The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy
, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile
you can use coalesce(1,true).saveAsTextFile()
. This basically means do the computation then coalesce to 1 partition. You can also use repartition(1)
which is just a wrapper for coalesce
with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…