I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
.withColumn("month", functions.date_format(col("createdAt"), "MM"))
.withColumn("day", functions.date_format(col("createdAt"), "dd"))
.writeStream()
.trigger(processingTime='15 seconds')
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/some/checkpoint/directory/")
.option("path", "/some/directory/")
.option("truncate", "false")
.partitionBy("year", "month", "day")
.start()
.awaitTermination();
The output files are in the following directory (as expected):
/s3-bucket/some/directory/year=2021/month=01/day=02/
Question:
Is there a way to customize the output directory name? I need it to be
/s3-bucket/some/directory/2021/01/02/
For backward compatibility reasons.
question from:
https://stackoverflow.com/questions/65617240/spark-structured-streaming-custom-partition-directory-name 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…