Skip to content

Filter in sparklyr | R Interface to Spark kathiriya1111 Spark By {Examples}

  • by

The sparklyr filter() function is a powerful tool for filtering data rows from RDD/DataFrame based on a given condition or SQL expression. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. In this sparklyr article, we will explore how to apply a filter on DataFrame columns of string, arrays, and struct types using both single and multiple conditions using R language. We will also dive into the process of applying a filter using %in% operator.

By the end of this sparklyr article, you will have an in-depth understanding of how to leverage the sparklyr filter() function to extract the desired data rows from your Spark data frame using a variety of filtering techniques to refine your results. Whether you’re filtering by string, array, or struct types, or using %in% to filter based on a set of specific values, Sparklyr provides a wealth of tools and capabilities to help you effectively manipulate and analyze your data. So let’s dive in and explore the world of Sparklyr filtering!

1. Sparklyr DataFrame filter() Syntax

The code snippet below shows the syntax of the sparklyr filter() function. You will use an expression to specify the condition you wish to apply for filtering and apply it to the data frame.

# Syntax of filter()
filter(dataframe,condition)

Before we start with examples, first let’s create a DataFrame.

library(sparklyr)

data <- data.frame(
id = 1:6,
name = c(“James”, “Anna”, “Julia”, “Maria”, “Jen”, “Mike”),
age = c(30, 25, 35, 40, 45, 20),
city = c(“New York”, “London”, “Paris”, “Madrid”, “Berlin”, “Rome”)
)

# Initiate spark connection
sc <- spark_connect(master = ‘local’,
spark_home = Sys.getenv(“SPARK_HOME”),
app_name = “SparkByExamples.com”,
method = ‘shell’,
version = ‘3.0.0’)

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, ‘df’)

# View the first 5 rows
df % head(5)

The above code yields the below output;

2. DataFrame filter() with Column Condition in sparklyr

library(sparklyr)

# using equals condition
df %>%
filter(city == “London”) %>%
View()
Output:
# Source: spark [?? x 4]
id name age city

1 2 Anna 25 London

# using not equal condition
df %>%
filter(city != “London”) %>%
show()
Output:
# Source: spark [?? x 4]
id name age city

1 1 James 30 New York
2 3 Julia 35 Paris
3 4 Maria 40 Madrid
4 5 Jen 45 Berlin
5 6 Mike 20 Rome

# using equal with negate in front
df %>%
filter(!(city== “London”)) %>%
show()
# Source: spark [?? x 4]
id name age city

1 1 James 30 New York
2 3 Julia 35 Paris
3 4 Maria 40 Madrid
4 5 Jen 45 Berlin
5 6 Mike 20 Rome

The first block of code applies an equals condition to filter the data frame, using the %>% operator to chain together a series of operations. Specifically, the sparklyr filter() function is used to retrieve the rows where the city column matches the value “London”. Finally, the View() function is used to display the filtered data frame.

The resulting output of the first block of code is a data frame that contains a single row. It is displaying the details of a person with id 2, named Anna, aged 25, and residing in London.

In contrast, the second block of code applies a not-equal condition to filter the data frame. Here, the sparklyr filter() function is used again to retrieve the rows where the city column does not match the value “London”. The resulting data frame is then displayed using the show() function.

In summary, the second block of code produces a data frame that shows all the rows where the city column is not equal to “London”. By contrast, the first block of code only returns a single row where the city column is equal to “London”.

3. Filter with multiple conditions together

library(sparklyr)

# using equals condition
df %>%
filter(city != “London” & age > 30) %>%
show()
Output:
# Source: spark [?? x 4]
id name age city

1 3 Julia 35 Paris
2 4 Maria 40 Madrid
3 5 Jen 45 Berlin

This code block employs a series of operations to apply two filter conditions to the data frame: filtering for both city not equal to “London” and age greater than 30. The %>% operator chains these operations together.

To accomplish this, the code utilizes the sparklyr filter() function to filter rows in the data frame where the city column is not equal to “London” and the age column is greater than 30. The & operator combines the two conditions.

4. Filter Based on List Values

library(sparklyr)

#Filter in List values
df %>%
filter(city %in% c(‘London’,’New York’)) %>%
show()

The above code yields the below output:

5. Filter Based on Starts With, Ends With, Contains

library(sparklyr)

#Filter in List values
library(sparklyr)

# search for city starts with N
data %>% filter(grepl(‘N*’, city))
Output:
id name age city
1 1 James 30 New York

# search for city ends with n
data %>% filter(grepl(‘*n’, city))
Output:
id name age city
1 2 Anna 25 London
2 5 Jen 45 Berlin

The code block utilizes the grepl() function to specify a pattern that matches the city column in the data frame. Specifically, it filters the rows where the city column begins with the letter “N”, followed by zero or more characters. Then, we apply the sparklyr filter() function to retrieve the resulting data frame, and we display it using the show() function.

6. Filter based on the pattern in sparklyr

# Search for name which contains a
data %>% filter(grepl(‘*a’, name))

Output:
id name age city
1 1 James 30 New York
2 2 Anna 25 London
3 3 Julia 35 Paris
4 4 Maria 40 Madrid

7. Source code of filter operation

library(sparklyr)

data <- data.frame(
id = 1:6,
name = c(“James”, “Anna”, “Julia”, “Maria”, “Jen”, “Mike”),
age = c(30, 25, 35, 40, 45, 20),
city = c(“New York”, “London”, “Paris”, “Madrid”, “Berlin”, “Rome”)
)

# initiate spark connection
sc <- spark_connect(master = ‘local’,
spark_home = Sys.getenv(“SPARK_HOME”),
app_name = “SparkByExamples.com”,
method = ‘shell’,
version = ‘3.0.0’)

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, ‘df’)

# View the first 5 rows
df % head(5)

# using equals condition
df %>%
filter(city == “London”) %>%
show()

# using not equal condition
df %>%
filter(city != “London”) %>%
show()

# using equal with negate in front
df %>%
filter(!(city== “London”)) %>%
show()

# using equals condition
df %>%
filter(city != “London” & age > 30) %>%
show()

#Filter in List values
df %>%
filter(city %in% c(‘London’,’New York’)) %>%
show()

# search for city starts with N
data %>% filter(grepl(‘N*’, city))

# search for city ends with n
data %>% filter(grepl(‘*n’, city))

# search for name which contains a
data %>% filter(grepl(‘*a’, name))

Conclusion

In conclusion SparklyrR filter() is used to rows from the DataFrame/RDD.

Reference

https://spark.rstudio.com/

 The sparklyr filter() function is a powerful tool for filtering data rows from RDD/DataFrame based on a given condition or SQL expression. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. In this sparklyr article, we will explore how to apply a filter on DataFrame  Read More R Programming, sparklyr 

Leave a Reply

Your email address will not be published. Required fields are marked *