Spark RDD filter is an operation that creates a new RDD by selecting the elements from the input RDD that satisfy a given predicate (or condition). The filter operation does not modify the original RDD but creates a new RDD with the filtered elements. In this article, we shall discuss the syntax of Spark RDD Read More Apache Spark Spark By {Examples}
Spark RDD filter is an operation that creates a new RDD by selecting the elements from the input RDD that satisfy a given predicate (or condition). The filter operation does not modify the original RDD but creates a new RDD with the filtered elements. In this article, we shall discuss the syntax of Spark RDD Filter and different patterns to apply it.
Table of contents
1. Syntax of Spark RDD Filter2. Spark RDD Filter Examples2.1 Filter based on a condition using a lambda function2.2 Filter based on multiple conditions2.3 Filter based on the presence of a certain substring2.4 Filter based on the length of a string2.5 Filter based on a regular expression3. Conclusion
1. Syntax of Spark RDD Filter
The syntax for the RDD filter in Spark using Scala is:
// Syntax of RDD filter()
val filteredRDD = inputRDD.filter(predicate)
Here, inputRDD is the RDD to be filtered and predicate is a function that takes an element from the RDD and returns a boolean value indicating whether the element satisfies the filtering condition. The filteredRDD is the resulting RDD containing only the elements that satisfy the predicate.
For example, suppose we have an RDD of integers and we want to filter out the even numbers. We can use the filter operation as follows:
// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark = SparkSession.builder()
.appName(“Creating DataFrame”)
.master(“local[*]”)
.getOrCreate()
// Create RDD
val inputRDD = spark.sparkContext
.parallelize(Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
// RDD filter() usage
val filteredRDD = inputRDD.filter(x => x % 2 != 0)
In this example, we create an input RDD with ten integers, and then we apply the filter operation with the predicate x % 2 != 0 to select only the odd numbers. The resulting filteredRDD will contain the elements 1, 3, 5, 7, 9.
2. Spark RDD Filter Examples
Following are some more examples of using RDD filter().
2.1 Filter based on a condition using a lambda function
First, let’s see how to filter RDD by using lambda function.
val rdd = spark.sparkContext
.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
val filteredRDD = rdd.filter(x => x % 2 == 0)
Here, the Spark RDD filter will create an RDD containing only the even numbers (2, 4, 6, 8, 10).
2.2 Filter based on multiple conditions
Below is an example of filtering Spark RDD on multiple conditions.
val rdd = spark.sparkContext
.parallelize(List(“apple”, “banana”, “pear”, “orange”))
val filteredRDD = rdd.filter(x => x.startsWith(“a”) || x.startsWith(“o”))
This will create an RDD containing only the strings starting with “a” or “o” (“apple” and “orange”).
2.3 Filter based on the presence of a certain substring
val rdd = spark.sparkContext
.parallelize(List(“apple”, “banana”, “pear”, “orange”))
val filteredRDD = rdd.filter(x => x.contains(“p”))
This will create an RDD containing only the strings containing the letter “p” (“apple”, “pear”, and “orange”).
2.4 Filter based on the length of a string
vval rdd = spark.sparkContext
.parallelize(List(“apple”, “banana”, “pear”, “orange”))
val filteredRDD = rdd.filter(x => x.length > 5)
This will create an RDD containing only the strings with lengths greater than 5 (“banana” and “orange”).
2.5 Filter based on a regular expression
val rdd = spark.sparkContext
.parallelize(List(“apple”, “banana”, “pear”, “orange”))
val filteredRDD = rdd.filter(x => x.matches(“.*a.*”))
Spark RDD will create an RDD containing only the strings containing the letter “a” (“apple” and “banana”).
3. Conclusion
In conclusion, the Spark RDD filter is a transformation operation that allows you to create a new RDD by selecting only the elements from an existing RDD that meet a specific condition. This operation is efficient because it leverages the distributed nature of Spark to parallelize the filtering process across multiple nodes.
Related Articles
Apache Spark RDD Tutorial | Learn with Scala Examples
Spark RDD aggregate() operation example
Spark RDD fold() function example