The sparklyr filter() function is a powerful tool for filtering data rows from DataFrame based on a given condition or SQL expression. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. In this sparklyr article, we will explore how to apply a filter on DataFrame columns of string, arrays, and struct types using both single and multiple conditions using R language. We will also dive into the process of applying a filter using %in% operator.

By the end of this sparklyr article, you will have an in-depth understanding of how to leverage the sparklyr filter() function to extract the desired data rows from your Spark data frame using a variety of filtering techniques to refine your results. Whether you’re filtering by string, array, or struct types, or using %in% to filter based on a set of specific values, Sparklyr provides a wealth of tools and capabilities to help you effectively manipulate and analyze your data. So let’s dive in and explore the world of Sparklyr filtering!

1. Sparklyr DataFrame filter() Syntax

The code snippet below shows the syntax of the sparklyr filter() function. You will use an expression to specify the condition you wish to apply for filtering and apply it to the data frame.

# Syntax of filter()
filter(dataframe,condition)

Before we start with examples, first let’s create a DataFrame.

library(sparklyr)

data <- data.frame(
id = 1:6,
name = c(“James”, “Anna”, “Julia”, “Maria”, “Jen”, “Mike”),
age = c(30, 25, 35, 40, 45, 20),
city = c(“New York”, “London”, “Paris”, “Madrid”, “Berlin”, “Rome”)
)

# Initiate spark connection
sc <- spark_connect(master = ‘local’,
spark_home = Sys.getenv(“SPARK_HOME”),
app_name = “SparkByExamples.com”,
method = ‘shell’,
version = ‘3.0.0’)

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, ‘df’)

# View the first 5 rows
df % head(5)

The above code yields the below output;

2. DataFrame filter() with Column Condition in sparklyr

Use Column with the condition to filter the rows from sparklyr DataFrame, using this you can also express complex conditions.

library(sparklyr)

# Using equals condition
df %>%
filter(city == “London”) %>%
View()

#cOutput:
# Source: spark [?? x 4]
# id name age city
#
#1 2 Anna 25 London

This sparklyr code applies an equals condition to filter the data frame, using the %>% operator to chain together a series of operations. Specifically, the sparklyr filter() function is used to retrieve the rows where the city column matches the value “London”. Finally, the View() function is used to display the filtered data frame.

The resulting output of the first block of code is a data frame that contains a single row. It is displaying the details of a person with id 2, named Anna, aged 25, and residing in London.

# Using not equal condition
df %>%
filter(city != “London”) %>%
show()

# Output:
# Source: spark [?? x 4]
# id name age city
#
#1 1 James 30 New York
#2 3 Julia 35 Paris
#3 4 Maria 40 Madrid
#4 5 Jen 45 Berlin
#5 6 Mike 20 Rome

# Using equal with negate in front
df %>%
filter(!(city== “London”)) %>%
show()

Here, the code applies a not equal condition to filter the data frame. Here, the sparklyr filter() function is used again to retrieve the rows where the city column does not match the value “London”. The resulting data frame is then displayed using the show() function.

Alternatively, you can also write the above statement by putting the negation at the front of the expression.

# Using equal with negate in front
df %>%
filter(!(city== “London”)) %>%
show()

This yields the same output as above example.

3. Filter with multiple conditions together

You can also filter sparklyr DataFrame with multiple conditions, just combine the multiple conditions together using the conditional operators and pass it as an expression to filter(). Here is an example.

library(sparklyr)

# Using multiple conditions
df %>%
filter(city != “London” & age > 30) %>%
show()

Output:
# Source: spark [?? x 4]
# id name age city
#
#1 3 Julia 35 Paris
#2 4 Maria 40 Madrid
#3 5 Jen 45 Berlin

This code block employs a series of operations to apply two filter conditions to the data frame: filtering for both city not equal to “London” and age greater than 30. The %>% operator chains these operations together.

To accomplish this, the code utilizes the sparklyr filter() function to filter rows in the data frame where the city column is not equal to “London” and the age column is greater than 30. The & operator combines the two conditions.

4. Filter Based on List Values

If you have a list of values in a R Vector and you would like to filter the rows based on a column value present in the list, use the sparklyr filter() with %in% operator.

library(sparklyr)

#Filter in List values
df %>%
filter(city %in% c(‘London’,’New York’)) %>%
show()

Here, c() is a combined function that is used to create a vector. The above code yields the below output:

5. Filter Based on Starts With, Ends With, Contains

library(sparklyr)

# Search for city starts with N
data %>% filter(grepl(‘N*’, city))

# Output:
# id name age city
#1 1 James 30 New York

# Search for city ends with n
data %>% filter(grepl(‘*n’, city))

# Output:
#id name age city
#1 2 Anna 25 London
#2 5 Jen 45 Berlin

The code block utilizes the grepl() function to specify a pattern that matches the city column in the data frame. Specifically, it filters the rows where the city column begins with the letter “N”, followed by zero or more characters. Then, we apply the filter() function to retrieve the resulting data frame, and we display it using the show() function.

6. Filter based on the pattern in sparklyr

Finally, let’s filter the DataFrame rows using pattern.

# Search for name which contains a
data %>% filter(grepl(‘*a’, name))

# Output:
#id name age city
#1 1 James 30 New York
#2 2 Anna 25 London
#3 3 Julia 35 Paris
#4 4 Maria 40 Madrid

7. Source code of filter operation

library(sparklyr)

# Initiate spark connection
sc <- spark_connect(master = ‘local’,
spark_home = Sys.getenv(“SPARK_HOME”),
app_name = “SparkByExamples.com”,
method = ‘shell’,
version = ‘3.0.0’)

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, ‘df’)

# View the first 5 rows
df % head(5)

# Using equals condition
df %>%
filter(city == “London”) %>%
show()

# Using not equal condition
df %>%
filter(city != “London”) %>%
show()

# Using equal with negate in front
df %>%
filter(!(city== “London”)) %>%
show()

# Using multiple condition
df %>%
filter(city != “London” & age > 30) %>%
show()

# Filter in List values
df %>%
filter(city %in% c(‘London’,’New York’)) %>%
show()

# Search for city starts with N
data %>% filter(grepl(‘N*’, city))

# Search for city ends with n
data %>% filter(grepl(‘*n’, city))

# Search for name which contains a
data %>% filter(grepl(‘*a’, name))

Conclusion

In conclusion sparklyrr filter() is used to filter rows from the DataFrame/RDD. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. I hope you have learned how to apply a filter on DataFrame columns of string, arrays, and struct types using both single and multiple conditions using sparklyr in R language.

Reference

https://spark.rstudio.com/

Filter in sparklyr | R Interface to Spark kathiriya1111 Spark By {Examples}

1. Sparklyr DataFrame filter() Syntax

2. DataFrame filter() with Column Condition in sparklyr

3. Filter with multiple conditions together

4. Filter Based on List Values

5. Filter Based on Starts With, Ends With, Contains

6. Filter based on the pattern in sparklyr

7. Source code of filter operation

Conclusion

Reference

Leave a Reply Cancel reply