Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
807 views
in Technique[技术] by (71.8m points)

sorting - How does Spark achieve sort order?

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it's own part of original list. So, how does Spark achieve the final sorted order, does it merge results?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Sorting in Spark is a multiphase process which requires shuffling:

  1. input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
  2. input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
  3. each partition from the second step is sorted locally (mapPartitions)

When the data is collected, all that is left is to follow the order defined by the partitioner.

Above steps are clearly reflected in a debug string:

scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...

scala> rdd.sortBy(identity).toDebugString
res1: String = 
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
 |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
 +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
    |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...