Skip to content

Remove Header from Spark DataFrame rimmalapudi Spark By {Examples}

  • by

How can I remove header from Dataframe? We often require removing a header or column names while writing a DataFrame output to a CSV file. The header row contains the column names of the DataFrame, and it is typically the first row in the output file. Including or excluding the header row can depend on the specific use case, but in some cases, removing the header row can make the output more suitable for further processing or analysis.

In this article, we shall discuss How to remove headers from a Spark DataFrame and high-level details on Spark DataFrame Header.

1. Header in Spark DataFrame

In Spark Scala, a header in a DataFrame refers to the first row of the DataFrame that contains the column names. The header row provides descriptive labels for the data in each column and helps to make the DataFrame more readable and easier to work with.

For example, consider the following DataFrame:

// Output:
+———–+———-+—+
|first_name | last_name|age|
+———–+———-+—+
| John | Doe| 25|
| Jane | Doe| 30|
| Bob | Smith| 45|
+———–+———-+—+

In this DataFrame, the header row contains the column names first_name, last_name, and age. These names are used to identify and access the data in each column.

Note that not all DataFrames have headers. For example, if you load a CSV file that does not contain a header row, Spark Scala will create default column names (e.g. _c0, _c1, etc.) for the DataFrame. In this case, you may need to manually add a header row to the DataFrame using the toDF method with a list of column names as an argument.

2. How to remove headers while writing to CSV file

In Spark, you can control whether or not to write the header row when writing a DataFrame to a file, such as a CSV file, by using the header option. When the header option is set to true (the default), Spark includes the header row in the output file. When the header option is set to false, Spark excludes the header row.

Here is an example code snippet:

//Imports
import org.apache.spark.sql.{DataFrame, SparkSession}

//Create Spark Session
val spark = SparkSession.builder()
.appName(“RemoveHeaderExample”)
.master(“local[*]”)
.getOrCreate()

//Create a dataFrame of Name and Age
val data: Seq[(String, Int)] = Seq((“Alice”, 25), (“Bob”, 30), (“Charlie”, 35))
val df: DataFrame = spark.createDataFrame(data).toDF(“name”, “age”)

//Write the Spark Dataframe into CSV file Without Header
df.write
.option(“header”, false) // disable writing header
.csv(“output.csv”)

In this example,

the DataFrame is created from a sequence of tuples and has two columns: name and age.

The write method is called on the DataFrameWriter object returned by df.write to write the DataFrame to a CSV file.

The option method is used to pass options to the CSV writer.

Here, we set the header option to false to disable writing the header row.

Note that when you read the CSV file later, you may need to specify the schema manually, as Spark may infer the schema from the data and assume that the first row is a header. You can do this by using the schema method of the DataFrameReader object and passing the schema as a StructType object.

3. How to remove Headers from Spark DataFrame

There are different ways to remove headers from a Spark DataFrame, depending on the use case and the specific requirements of the task at hand. Including or excluding the header row can depend on the specific use case, but in some cases, removing the header row can make the output more suitable for further processing or analysis.

Here are some of the ways to remove Headers from Spark DataFrame:

3.1 Using withColumnRenamed method with an empty string:

//Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

//Creating Spark Session
val spark = SparkSession.builder().appName(“Remove All Column Names Example”).getOrCreate()

//Create DataFrame
val data = Seq(
(“John”, “Doe”, 25),
(“Jane”, “Doe”, 30),
(“Bob”, “Smith”, 45)
)

val columns = Seq(“first_name”, “last_name”, “age”)

val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()

//Results
+———-+———+—+
|first_name|last_name|age|
+———-+———+—+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+———-+———+—+

val dfWithoutColumns = columns.foldLeft(df)((acc, c) => acc.withColumnRenamed(c, “”))
dfWithoutColumns.show()
//Results
+—-+—–+—+
| | | |
+—-+—–+—+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+—-+—–+—+

In this example, we use the foldLeft method with a sequence of columns (i.e., the header column) and the DataFrane. We then use the withColumnRenamed method to rename the columns to empty in the DataFrame.

3.2 Using select method with no arguments

//Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

val spark = SparkSession.builder().appName(“Remove All Column Names Example”).getOrCreate()

val data = Seq(
(“John”, “Doe”, 25),
(“Jane”, “Doe”, 30),
(“Bob”, “Smith”, 45)
)

val columns = Seq(“first_name”, “last_name”, “age”)

val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()
//Results
+———-+———+—+
|first_name|last_name|age|
+———-+———+—+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+———-+———+—+

// Remove all column names
val dfWithoutColumns2 = df.select((0 until df.columns.length).map(i => col(s”_c$i”)): _*)
dfWithoutColumns2.show()
//Results
+—-+—–+—+
| | | |
+—-+—–+—+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+—-+—–+—+

In this example, we use the select method with a sequence of column indices to select all columns except the first column (i.e., the header column). We create this sequence using the 0 until df.columns.length notation to generate a sequence of integers from 0 to the number of columns in the DataFrame, and then map each integer to the corresponding column using the col method. Finally, we use the :_* notation to pass the sequence as individual arguments to the select method.

3.3. Using toDF method with no arguments

//Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

val spark = SparkSession.builder().appName(“Remove All Column Names Example”).getOrCreate()

val data = Seq(
(“John”, “Doe”, 25),
(“Jane”, “Doe”, 30),
(“Bob”, “Smith”, 45)
)

val columns = Seq(“first_name”, “last_name”, “age”)

val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()

//Results
+———-+———+—+
|first_name|last_name|age|
+———-+———+—+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+———-+———+—+

// Remove all column names
val dfWithoutColumns3 = df.toDF(df.columns.tail: _*)
dfWithoutColumns3.show()

//Results
+—-+—–+—+
| | | |
+—-+—–+—+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+—-+—–+—+

In this example, we use the toDF method with a list of column names that excludes the first column name (i.e., the header name) using the tail method. This creates a new DataFrame with the same data as the original DataFrame but without the header row.

4. Conclusion

Removing all column names from a DataFrame header may not be a common operation, as it can make it difficult to work with the data. In practice, you may want to keep at least some descriptive information about the columns.

Related Articles

Spark DataFrame Tutorial with Examples

Spark Read Text File | RDD | DataFrame

Spark Filter Using contains() Examples

Spark groupByKey() vs reduceByKey()

Spark Internal Execution plan

Spark Web UI – Understanding Spark Execution
 How can I remove header from Dataframe? We often require removing a header or column names while writing a DataFrame output to a CSV file. The header row contains the column names of the DataFrame, and it is typically the first row in the output file. Including or excluding the header row can depend on  Read More Apache Spark 

Leave a Reply

Your email address will not be published. Required fields are marked *