partitioning - How to spark partitionBy/bucketBy correctly?

Question

Welcome To Ask or Share your Answers For Others

partitioning - How to spark partitionBy/bucketBy correctly?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

partitioning - How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?

Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)

question from:https://stackoverflow.com/questions/65929246/how-to-spark-partitionby-bucketby-correctly

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:01:43+0000

Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:

Question 1:

A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may set the number of partitions for shuffling or use the default - 200. There are more parties (DF's) to consider.

repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy evaluation - determining if something is necessary upon Action invocation.

Question 2:

If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The databricks docs show this clearly.

A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.

Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have different number of partitions inside Spark framework after the read. It's a bit more complicated than saying I have N partitions so I will get N partitions on the initial read.

Categories

partitioning - How to spark partitionBy/bucketBy correctly?

partitioning - How to spark partitionBy/bucketBy correctly?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags