python 3.x - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

Question

Welcome To Ask or Share your Answers For Others

python 3.x - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

I have pyspark.rdd.PipelinedRDD (Rdd1). when I am doing Rdd1.collect(),it is giving result like below.

 [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
 (1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
 (2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
 (3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]

Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method

My final data frame should be like below.df.show() should be like:

+----------+-------+-------------------+
|CId       |IID    |Score              |
+----------+-------+-------------------+
|10        |4      |2.9996439803387602 |
|10        |5      |1.6767412921625855 |
|10        |3      |3.616726727464709  |
|1         |4      |-1.5271512313750577|
|1         |5      |1.9665475696370045 |
|1         |3      |2.016527311459324  |
|2         |4      |4.033642544526678  |
|2         |5      |3.1517805604906313 |
|2         |3      |6.230272144805092  |
|3         |4      |2.9757316477407443 |
|3         |5      |-1.5689126834176417|
|3         |3      |-0.3924680103722977|
+----------+-------+-------------------+

I can achieve this converting to rdd next applying collect() ,iteration and finally Data frame.

but now I want to convert pyspark.rdd.PipelinedRDD (RDD1) to Data frame with out using any collect() method.

please let me know how to achieve this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:53:47+0000

You want to do two things here: 1. flatten your data 2. put it into a dataframe

One way to do it is as follows:

First, let us flatten the dictionary:

rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()])

When collecting the data, you get something like this:

[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ...

Then we can format the data and turn it into a dataframe:

rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))
    .toDF(("CId", "IID", "Score"))
    .show()

which gives you this:

+---+---+-------------------+
|CId|IID|              Score|
+---+---+-------------------+
| 10|  3|  3.616726727464709|
| 10|  4| 2.9996439803387602|
| 10|  5| 1.6767412921625855|
|  1|  3|  2.016527311459324|
|  1|  4|-1.5271512313750577|
|  1|  5| 1.9665475696370045|
|  2|  3|  6.230272144805092|
|  2|  4|  4.033642544526678|
|  2|  5| 3.1517805604906313|
|  3|  3|-0.3924680103722977|
|  3|  4| 2.9757316477407443|
|  3|  5|-1.5689126834176417|
+---+---+-------------------+

Categories

python 3.x - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

python 3.x - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags