Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
225 views
in Technique[技术] by (71.8m points)

python - Dataframe transpose with pyspark in Apache Spark

I have a dataframe df that have following structure:

+-----+-----+-----+-------+
|  s  |col_1|col_2|col_...|
+-----+-----+-----+-------+
| f1  |  0.0|  0.6|  ...  |
| f2  |  0.6|  0.7|  ...  |
| f3  |  0.5|  0.9|  ...  |
|  ...|  ...|  ...|  ...  |

And I want to calculate the transpose of this dataframe so it will be look like

+-------+-----+-----+-------+------+
|  s    | f1  | f2  | f3    |   ...|
+-------+-----+-----+-------+------+
|col_1  |  0.0|  0.6|  0.5  |   ...|
|col_2  |  0.6|  0.7|  0.9  |   ...|
|col_...|  ...|  ...|  ...  |   ...|

I tied this two solutions but it returns that dataframe has not the specified used method:

method 1:

 for x in df.columns:
    df = df.pivot(x)

method 2:

df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in df.items()]).toDF()

how can I fix this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If data is small enough to be transposed (not pivoted with aggregation) you can just convert it to Pandas DataFrame:

df = sc.parallelize([
    ("f1", 0.0, 0.6, 0.5),
    ("f2", 0.6, 0.7, 0.9)]).toDF(["s", "col_1", "col_2", "col_3"])

df.toPandas().set_index("s").transpose()
s       f1   f2
col_1  0.0  0.6
col_2  0.6  0.7
col_3  0.5  0.9

If it is to large for this, Spark won't help. Spark DataFrame distributes data by row (although locally uses columnar storage), therefore size of a individual rows is limited to local memory.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...