In this article, I will explain how to rearrange or change the column position in Spark Dataframe. In Apache Spark, a DataFrame is a distributed collection of data organized into named columns. To access a specific column in a Spark DataFrame, you can use the col function or the $ operator.
Here’s an example of using the col function to access a column named “column_name” in a DataFrame named “df”:
//Imports Spark col
import org.apache.spark.sql.functions.col
//Select column using col
df.select(col(“column_name”))
//Alternatively, you can use the $ operator to access a column:
df.select($”column_name”)
To discuss different ways to change the column position in Spark Dataframe, Let’s first create a sample DataFrame.
1. Creating Sample DataFrame
To create a DataFrame in Spark Scala with order details, you can use the createDataFrame function from the SparkSession object. Here’s an example:
// Imports
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
// Define the schema for the orders DataFrame
val schema = StructType(Seq(
StructField(“order_id”, IntegerType),
StructField(“customer_id”, IntegerType),
StructField(“product_name”, StringType),
StructField(“price”, DoubleType),
StructField(“quantity”, IntegerType)
))
// Create a list of Row objects with order details
val orders = Seq(
Row(1, 100, “Product A”, 10.0, 2),
Row(2, 200, “Product B”, 20.0, 1),
Row(3, 100, “Product C”, 30.0, 3),
Row(4, 300, “Product D”, 15.0, 2)
)
// Create a SparkSession object
val spark = SparkSession.builder().appName(“Create DataFrame with Order Details”).master(“local[*]”).getOrCreate()
// Create a DataFrame from the schema and the list of Row objects
val ordersDF = spark.createDataFrame(spark.sparkContext.parallelize(orders), schema)
// Display the orders DataFrame
ordersDF.show()
In this sample Dataframe, we first define the schema for the orders DataFrame, which consists of five columns: order_id, customer_id, product_name, price, and quantity. We then create a list of Row objects with the order details.
Next, we create an SparkSession object using the builder method and then use the createDataFrame method to create a DataFrame from the schema and the list of Row objects. Finally, we display the orders DataFrame using the show method.
The output of the DataFrame looks like this:
//Display output of orders Dataframe
+——–+———–+————+—–+——–+
|order_id|customer_id|product_name|price|quantity|
+——–+———–+————+—–+——–+
| 1| 100| Product A| 10.0| 2|
| 2| 200| Product B| 20.0| 1|
| 3| 100| Product C| 30.0| 3|
| 4| 300| Product D| 15.0| 2|
+——–+———–+————+—–+——–+
2. Different ways to change the column position of a Spark DataFrame
Changing the column position of a Spark DataFrame in Scala can be done in several ways. Let us use the orders Dataframe that we created above and try out different ways to change the positions and so, Here are some examples:
2.1. Using select function:
In Spark Scala, the select function is used to select one or more columns in a user-specific order or all columns from a DataFrame.
//imports
import org.apache.spark.sql.functions.col
// define the new order of columns
val newOrder = Seq(“product_name”, “price”, “quantity”, “order_id”, “customer_id”)
val dfNewOrder = ordersDF.select(newOrder.map(c => col(c)): _*) // select columns in the new order
// display the orders DataFrame with re-ordered columns
dfNewOrder.show()
In this example, we define the new order of columns as a sequence of strings. Then, we use the select function to select the columns in the new order on the ordersDD, by passing each column name to the col function and then unpacking them with the :_* operator. Finally, we get a new orders DataFrame with the columns in the desired order as stated in the newOrder list.
The result after changing the column positions of the Spark DataFrame looks like this:
+————+—–+——–+——–+———–+
|product_name|price|quantity|order_id|customer_id|
+————+—–+——–+——–+———–+
| Product A| 10.0| 2| 1| 100|
| Product B| 20.0| 1| 2| 200|
| Product C| 30.0| 3| 3| 100|
| Product D| 15.0| 2| 4| 300|
+————+—–+——–+——–+———–+
2.2. Using selectExpr and withColumnRenamed functions
In Spark Scala, the selectExpr function is used to select one or more columns from a DataFrame and apply a transformation to them.
//Using selectExpr and rename column quantity into product_quantity
val dfNewOrder = ordersDF.selectExpr(“product_name”, “price”, “quantity”, “order_id”, “customer_id”).withColumnRenamed(“quantity”, “product_quantity”)
// display the orders DataFrame with re-ordered columns
dfNewOrder.show()
In this example, we use the selectExpr function to select the columns in the desired order, by passing the column names as strings. Then, we use the withColumnRenamed function to rename the column “quantity” to “product_quantity”, which effectively changes its position in the DataFrame.
The result after changing the column positions of the Spark DataFrame looks like this:
//Output of the dfNewOrder
+————+—–+—————-+——–+———–+
|product_name|price|product_quantity|order_id|customer_id|
+————+—–+—————-+——–+———–+
| Product A| 10.0| 2| 1| 100|
| Product B| 20.0| 1| 2| 200|
| Product C| 30.0| 3| 3| 100|
| Product D| 15.0| 2| 4| 300|
+————+—–+—————-+——–+———–+
2.3 Using select and alias functions:
//Using select and alias to rename column quantity into product_quantity
val dfNewOrder = ordersDF.select(col(“product_name”), col(“price”), col(“quantity”).alias(“product_quantity”), col(“order_id”), col(“customer_id”))
// display the orders DataFrame with re-ordered columns
dfNewOrder.show()
In this example, we use the select function to select the columns in the desired order, by passing each column to the col function. Then, we use the alias function to rename the column “quantity” to “product_quantity”, which effectively changes its position in the DataFrame.
The result after changing the column positions of the Spark DataFrame looks like this:
//Output of the dfNewOrder
+————+—–+—————-+——–+———–+
|product_name|price|product_quantity|order_id|customer_id|
+————+—–+—————-+——–+———–+
| Product A| 10.0| 2| 1| 100|
| Product B| 20.0| 1| 2| 200|
| Product C| 30.0| 3| 3| 100|
| Product D| 15.0| 2| 4| 300|
+————+—–+—————-+——–+———–+
Note that these methods create a new DataFrame with the columns in the desired order, but do not modify the original DataFrame.
3. Conclusion
In summary, we can change the column position of a Spark Scala DataFrame using the select and selectExpr functions. To move a column to a specific position, we can use the select function to select all the columns except the one we want to move, and then select the column again in the desired position using the col function. Alternatively, we can use the selectExpr function to select the desired columns and apply SQL expressions to them when needed.
Related Articles
Spark Dataframe – Show Full Column Contents?
Spark withColumnRenamed to Rename Column
Spark select() vs selectExpr() with Examples
Spark RDD vs DataFrame vs Dataset
Spark RDD Actions with examples
In this article, I will explain how to rearrange or change the column position in Spark Dataframe. In Apache Spark, a DataFrame is a distributed collection of data organized into named columns. To access a specific column in a Spark DataFrame, you can use the col function or the $ operator. Here’s an example of Read More Apache Spark