Skip to content

Filter Spark DataFrame Based on Date rimmalapudi Spark By {Examples}

  • by

How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples.

1. Filter Syntax

The filter() function can be used to select a subset of data from a DataFrame or Dataset based on a condition. In Scala, you can use the filter method to apply a filter to a DataFrame or Dataset.

The basic syntax of the filter operation in Scala is as follows:

val filteredDF = originalDF.filter(condition)

where originalDF is the DataFrame you want to filter, condition is a Boolean expression that specifies the filtering condition, and filteredDF is the resulting DataFrame that contains only the rows that satisfy the filtering condition.

For example, let’s say you have a DataFrame employeesDF that contains employee data with columns “name”, “age”, and “department”. You can use the filter operation to select only the employees who are older than 30 as follows:

val filteredDF = employeesDF.filter($”age” > 30)

Here, $”age” is used to access the “age” column of the DataFrame, and > 30 is the filtering condition.

You can also combine multiple conditions using logical operators like && (AND) and || (OR). For example, to select employees who are older than 30 and work in the “sales” department, you can use:

val filteredDF = employeesDF.filter($”age” > 30 && $”department” === “sales”)

Here, && is used to combine the two conditions, and === is used for string equality comparison.

2. Examples of Date based filtering

You can use Spark SQL functions to filter a DataFrame based on date columns. Here are some examples of how to filter a DataFrame based on different date conditions:

2.1 Filter based on a specific date

import org.apache.spark.sql.functions._

val df = spark.read.format(“csv”).option(“header”, “true”).load(“path/to/file.csv”)
val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”) === lit(“2022-01-01”))

In this example, we are filtering the DataFrame df based on the date_col column which is in the “yyyy-MM-dd” format. We use the to_date function to convert the column to a date type and compare it to a specific date using the lit function.

2.2 Filter based on a date range

val startDate = “2022-01-01”
val endDate = “2022-01-31”
val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”).between(startDate, endDate))

Here, we are filtering the DataFrame df based on the date_col column between two dates, startDate and endDate. We use the to_date function to convert the column to a date type and use the between function to specify the date range.

2.3 Filter based on the current date

val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”) === current_date())

This example filters the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and compares it to the current date using the current_date function.

2.4 Filter based on a date difference

val filteredDf = df.filter(datediff(to_date($”date_col”, “yyyy-MM-dd”), current_date()) > 30)

In this example, we are filtering the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and comparing it to the current date using the datediff function. We are filtering the rows where the difference between the date_col and the current date is greater than 30 days.

3. Conclusion

In conclusion, filtering a Spark DataFrame based on date in Scala can be done based on a specific date, a date range, the current date, or a date difference by using the appropriate function with the desired condition.

The to_date function converts a column to a date type so that it can be compared with other dates or used with date functions.

It’s important to ensure that the date column is in a format that can be parsed by the to_date function, otherwise, the filtering may not work as expected.

Related Articles

Spark DataFrame Tutorial with Examples

Spark RDD filter() with examples

Spark date_format() – Convert Timestamp to String

Spark SQL datediff()

Spark to_date() – Convert timestamp to date
 How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples. 1. Filter Syntax The filter() function can be used to select a subset of data from a DataFrame  Read More Apache Spark 

Leave a Reply

Your email address will not be published. Required fields are marked *