How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples.
1. Filter Syntax
The filter() function can be used to select a subset of data from a DataFrame or Dataset based on a condition. In Scala, you can use the filter method to apply a filter to a DataFrame or Dataset.
The basic syntax of the filter operation in Scala is as follows:
val filteredDF = originalDF.filter(condition)
where originalDF is the DataFrame you want to filter, condition is a Boolean expression that specifies the filtering condition, and filteredDF is the resulting DataFrame that contains only the rows that satisfy the filtering condition.
For example, let’s say you have a DataFrame employeesDF that contains employee data with columns “name”, “age”, and “department”. You can use the filter operation to select only the employees who are older than 30 as follows:
val filteredDF = employeesDF.filter($”age” > 30)
Here, $”age” is used to access the “age” column of the DataFrame, and > 30 is the filtering condition.
You can also combine multiple conditions using logical operators like && (AND) and || (OR). For example, to select employees who are older than 30 and work in the “sales” department, you can use:
val filteredDF = employeesDF.filter($”age” > 30 && $”department” === “sales”)
Here, && is used to combine the two conditions, and === is used for string equality comparison.
2. Examples of Date based filtering
You can use Spark SQL functions to filter a DataFrame based on date columns. Here are some examples of how to filter a DataFrame based on different date conditions:
2.1 Filter based on a specific date
import org.apache.spark.sql.functions._
val df = spark.read.format(“csv”).option(“header”, “true”).load(“path/to/file.csv”)
val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”) === lit(“2022-01-01”))
In this example, we are filtering the DataFrame df based on the date_col column which is in the “yyyy-MM-dd” format. We use the to_date function to convert the column to a date type and compare it to a specific date using the lit function.
2.2 Filter based on a date range
val startDate = “2022-01-01”
val endDate = “2022-01-31”
val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”).between(startDate, endDate))
Here, we are filtering the DataFrame df based on the date_col column between two dates, startDate and endDate. We use the to_date function to convert the column to a date type and use the between function to specify the date range.
2.3 Filter based on the current date
val filteredDf = df.filter(to_date($”date_col”, “yyyy-MM-dd”) === current_date())
This example filters the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and compares it to the current date using the current_date function.
2.4 Filter based on a date difference
val filteredDf = df.filter(datediff(to_date($”date_col”, “yyyy-MM-dd”), current_date()) > 30)
In this example, we are filtering the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and comparing it to the current date using the datediff function. We are filtering the rows where the difference between the date_col and the current date is greater than 30 days.
3. Conclusion
In conclusion, filtering a Spark DataFrame based on date in Scala can be done based on a specific date, a date range, the current date, or a date difference by using the appropriate function with the desired condition.
The to_date function converts a column to a date type so that it can be compared with other dates or used with date functions.
It’s important to ensure that the date column is in a format that can be parsed by the to_date function, otherwise, the filtering may not work as expected.
Related Articles
Spark DataFrame Tutorial with Examples
Spark RDD filter() with examples
Spark date_format() – Convert Timestamp to String
Spark to_date() – Convert timestamp to date
How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples. 1. Filter Syntax The filter() function can be used to select a subset of data from a DataFrame Read More Apache Spark