The Spark write().option() and write().options() methods provide a way to set options while writing DataFrame or Dataset to a data source. It is a convenient way to persist the data in a structured format for further processing or analysis. In this article, we shall discuss the different write options Spark supports along with a few examples.
1. Syntax of Spark write() Options
The syntax of the write() method is as follows:
# Example of using option()
df.write.format(“csv”)
.option(“header”, “true”)
.option(“delimiter”, “|”)
.save(“/path/to/output”)
Here, df is the DataFrame or Dataset that you want to write, <format> is the format of the data source (e.g. “CSV”, “JSON”, “parquet”, etc.), <options> are the options that you want to specify for the data source (e.g. delimiter, header, compression codec, etc.), and <path> is the output path where you want to save the data.
The above example saves a DataFrame in CSV format with a header and a pipe (|) delimiter. This will write the contents of the DataFrame to the specified output path in CSV format with a header and a pipe delimiter.
With option() method, to add multiple options you need to chain the method. Alternatively, you can use the options() method with key-value pairs as an argument to provide multiple options.
# Example of using options()
df.write.format(“csv”)
.options(Map(“header”->”true”,”delimiter”->”|”))
.save(“/path/to/output”)
2. Available Spark Write() Options
Spark provides several options for writing data to different storage systems. Some of the most common write options are:
mode: The mode option specifies what to do if the output data already exists. The default value is error, but you can also set it to overwrite, append, ignore, or errorifexists.
format: This option specifies the file format to be used while writing the data. Spark supports many formats such as Parquet, Avro, ORC, JSON, CSV, and more.
partitionBy: This option is used to partition the output data by one or more columns. This can be helpful in optimizing queries that read only a subset of the data.
compression: This option is used to specify the compression codec to be used while writing the output data. Some of the supported codecs are gzip, snappy, and bzip2.
header: This option is used to specify whether to include the header row in the output file, for formats such as CSV.
nullValue: This option is used to specify the string representation of null values in the output file.
escape: This option is used to specify the escape character to use when writing data in formats like CSV.
quote: This option is used to specify the quote character to use when writing data in formats like CSV.
dateFormat: This option is used to specify the date format to be used while writing date or timestamp data.
timestampFormat: This option is used to specify the timestamp format to be used while writing timestamp data.
These are some of the common write options in Spark, but there are many others depending on the storage system and file format you are using.
3. Examples of Spark Write()
Here are some examples of using Spark write options in Scala:
Setting the output mode to overwrite
df.write.mode(“overwrite”).csv(“/path/to/output”)
2. Writing data in Parquet format
df.write.format(“parquet”).save(“/path/to/output”)
3. Partitioning the output data by a specific column
df.write.partitionBy(“date”).csv(“/path/to/output”)
4. Compressing the output data using gzip
df.write.option(“compression”, “gzip”).csv(“/path/to/output”)
5. Including the header row in the CSV output file
df.write.option(“header”, “true”).csv(“/path/to/output”)
6. Specifying the null value string
df.write.option(“nullValue”, “NA”).csv(“/path/to/output”)
7. Escaping special characters in the output file
df.write.option(“escape”, “””).csv(“/path/to/output”)
8. Specifying the quote character in the output file
df.write.option(“quote”, “‘”).csv(“/path/to/output”)
9. Setting the date format while writing date data:
df.write.option(“dateFormat”, “yyyy-MM-dd”).csv(“/path/to/output”)
10. Setting the timestamp format while writing timestamp data:
df.write.option(“timestampFormat”, “yyyy-MM-dd HH:mm:ss”).csv(“/path/to/output”)
These are just a few examples of Spark write options in Scala. There are many more options available depending on the storage system and file format you are using.
4. Conclusion
In conclusion, Spark provides a wide range of write options that can be used to customize the output data according to specific requirements. These options can be used to control the output mode, format, partitioning, compression, header, null value representation, escape and quote characters, date and timestamp formats, and more.
Related Articles
Spark or PySpark Write Modes Explained
Spark Read and Write MySQL Database Table
Spark Set JVM Options to Driver & Executors
The Spark write().option() and write().options() methods provide a way to set options while writing DataFrame or Dataset to a data source. It is a convenient way to persist the data in a structured format for further processing or analysis. In this article, we shall discuss the different write options Spark supports along with a few Read More Apache Spark