pyspark - Write each row of a spark dataframe as a separate file

Question

Welcome To Ask or Share your Answers For Others

pyspark - Write each row of a spark dataframe as a separate file

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Write each row of a spark dataframe as a separate file

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file). I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.

I cannot seem to find any information or examples on how to do this. And I am just starting to work with Spark and PySpark. Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:39:48+0000

When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.

There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.

The repartition and saving can be done as follows:

rows = df.count()
df.repartition(rows).write.csv('save-dir')

Categories

pyspark - Write each row of a spark dataframe as a separate file

pyspark - Write each row of a spark dataframe as a separate file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags