Skip to content

Spark Rename Multiple Columns Examples rimmalapudi Spark By {Examples}

  • by

How do perform rename multiple columns in Spark DataFrame? In Apache Spark DataFrame, a column represents a named expression that produces a value of a specific data type. You can think of a column as a logical representation of a data field in a table.

In this article, we shall discuss what is how to rename multiple columns or all columns with examples. So, let’s first create a Spark DataFrame with a few columns and use this DataFrame to rename columns.

// Create SparkSession
val spark:SparkSession = SparkSession.builder()
.master(“local[1]”).appName(“SparkByExamples.com”)
.getOrCreate()

// Create DataFrame
import spark.implicits._
val data = Seq((1, “John”, 20), (2, “Jane”, 25), (3, “Jim”, 30))
val df = data.toDF(“id”, “name”, “age”)
df.show()

Yields below output.

1. Spark Rename Multiple Columns from the List

here are a few different examples of how to Spark rename multiple columns in Spark Scala:

//Imports
import org.apache.spark.sql.functions.col

val newColumnNames = Seq(“new_id”, “new_name”, “new_age”)

val renamedDF = newColumnNames.foldLeft(df)((tempDF, newName) => tempDF.withColumnRenamed(tempDF.columns(newColumnNames.indexOf(newName)), newName))

renamedDF.show()

In this example,

We define a list called newColumnNames, which contains the new column names in the order we want them to appear in the DataFrame.

We then use the foldLeft operation to iterate over the newColumnNames list and rename the columns one by one.

The withColumnRenamed function is used to rename each column in the tempDF DataFrame.

We use the columns function to get an array of the current column names and indexOf function to find the index of the old column name in the array.

Finally, we assign the renamed DataFrame to a new variable renamedDF and display it using the show function.

The output of the code for the Spark Rename multiple columns above should be:

+——+——–+——-+
|new_id|new_name|new_age|
+——+——–+——-+
| 1| John| 20|
| 2| Jane| 25|
| 3| Jim| 30|
+——+——–+——-+

2. Rename Multiple Column Names from Map

import org.apache.spark.sql.functions.col

val df = Seq((1, “John”, 20), (2, “Jane”, 25), (3, “Jim”, 30)).toDF(“id”, “name”, “age”)

val columnsToRename = Map(“id” -> “new_id”, “name” -> “new_name”)

val renamedDF = columnsToRename.foldLeft(df){case (tempDF, (oldName, newName)) => tempDF.withColumnRenamed(oldName, newName)}

renamedDF.show()

In this example,

we define the DataFrame df with columns “id”, “name”, and “age”.

We also define a Map called columnsToRename, where the keys represent the old column names and the values represent the new column names.

We then use the foldLeft operation to iterate over the columnsToRename map and rename the columns one by one.

The withColumnRenamed function is used to rename each column in the tempDF DataFrame.

Finally, we assign the renamed DataFrame to a new variable renamedDF and display it using the show function.

The output of the code for the Spark Rename multiple columns above should be:

+——+——–+—+
|new_id|new_name|age|
+——+——–+—+
| 1| John| 20|
| 2| Jane| 25|
| 3| Jim| 30|
+——+——–+—+

3. Using a for loop and dynamic column names

import org.apache.spark.sql.functions.col

val df = Seq((1, “John”, 20), (2, “Jane”, 25), (3, “Jim”, 30)).toDF(“id”, “name”, “age”)

val oldColumnNames = df.columns

val newColumnNames = oldColumnNames.map(name => s”new_$name”)

for (i <- 0 until oldColumnNames.length) {
df = df.withColumnRenamed(oldColumnNames(i), newColumnNames(i))
}

df.show()

In this example,

we define the DataFrame df with columns “id”, “name”, and “age”.

We then define an array oldColumnNames that contains the current column names of df.

We then use the map function to create a new array newColumnNames that contains the new column names, where each name is the old name with the prefix “new_” added to it.

We then use a for loop to iterate over the oldColumnNames array and rename each column using the withColumnRenamed function.

The withColumnRenamed function takes two arguments: the old column name and the new column name.

Finally, we display the renamed DataFrame using the show function.

The output of the code above should be:

+——+——–+——-+
|new_id|new_name|new_age|
+——+——–+——-+
| 1| John| 20|
| 2| Jane| 25|
| 3| Jim| 30|
+——+——–+——-+

4. Other Spark Column Operations

In Spark, a column refers to a logical data structure representing a named expression that produces a value for each record in a DataFrame. Columns are the building blocks for constructing DataFrame transformations and manipulations in Spark.

To work with columns in Spark Scala, you can use the org.apache.spark.sql.functions package. This package provides many built-in functions for manipulating and transforming columns in a DataFrame.

Here are some common operations you can perform on columns in Spark Scala:

Selecting Columns: To select one or more columns from a DataFrame, you can use the select function. For example, to select columns col1 and col2 from a DataFrame df, you can write df.select(“col1”, “col2”).

Filtering Rows: To filter rows based on a condition, you can use the filter or where function. For example, to filter rows where the value in the col1 column is greater than 10, you can write df.filter(col(“col1”) > 10).

Adding Columns: To add a new column to a DataFrame, you can use the withColumn function. For example, to add a new column new_col that is the sum of col1 and col2, you can write df.withColumn(“new_col”, col(“col1”) + col(“col2”)).

Renaming Columns: To rename a column in a DataFrame, you can use the withColumnRenamed function. For example, to rename a column col1 to new_col1, you can write df.withColumnRenamed(“col1”, “new_col1”).

Aggregating Data: To aggregate data based on one or more columns, you can use the groupBy function. For example, to group data by col1 column and compute the sum of the col2 column for each group, you can write df.groupBy(“col1”).agg(sum(“col2”)).

These are just a few examples of what you can do with columns in Spark Scala. The org.apache.spark.sql.functions package provides many more functions for manipulating and transforming columns, so it’s worth exploring the documentation to learn more.

5. Conclusion

In this article, you have learned different ways of renaming multiple columns in Spark, some approaches involve explicitly specifying the new names for each column using the withColumnRenamed() function or passing a list of old and new column names to the toDF() method. It all depends on the requirement from the action standpoint.

Related Articles

Spark Merge Two DataFrames with Different Columns or Schema

Spark withColumnRenamed to Rename Column

Spark RDD fold() function example

Spark map() vs flatMap() with Examples

Spark Internal Execution plan
 How do perform rename multiple columns in Spark DataFrame? In Apache Spark DataFrame, a column represents a named expression that produces a value of a specific data type. You can think of a column as a logical representation of a data field in a table. In this article, we shall discuss what is how to  Read More Apache Spark 

Leave a Reply

Your email address will not be published. Required fields are marked *