Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
931 views
in Technique[技术] by (71.8m points)

pyspark - How to perform group by and aggregate operation on spark

I have a Dataset below like:

+----------------------------------+--------------------------------------------------------------------+----------+
|word                              |features                                                            |prediction|
+----------------------------------+--------------------------------------------------------------------+----------+
|simple sentence                   |(2000,[1092,1980],[0.0,0.5753641449035617])                         |1         |
|simple important sentence         |(2000,[537,1092,1980],[0.28768207245178085,0.0,0.28768207245178085])|0         |
|important sentence                |(2000,[537,1092],[0.5753641449035617,0.0])                          |0         |
+----------------------------------+--------------------------------------------------------------------+----------+

here I have 2 clusters (0 and 1),

I want to select the words that have the most weight in each cluster (for at least 2 words)

for example:

(For convenience, I specified weights for each word )

 prediction
 1          sentence(1092)(0.0)                   simple(1980)(0.5753641449035617)                     
 0          important(537)(0.28768207245178085)   sentence(1092)(0.0)    simple(1980)(0.28768207245178085)
 0          important(537)(0.5753641449035617)    sentence(1092)(0.0) 

So based on the above dataset The highest weight among the words of cluster 1 is related to

"simple"(0.5753641449035617) and "sentence"(0.0)

also the highest weight in cluster 0 is related to

"important"(0.5753641449035617) and "simple"(0.28768207245178085) 

Based on the above, I expect the output to look like the following

|prediction|docname                                                                                   |top_terms            |   weight
+----------+------------------------------------------------------------------------------------------+---------------------+ ---------------------+
|1         |[simple sentence simple]                                                                  |[simple, sentence]   |  [0.0,0.5753641449035617]
|0         |[simple important sentence, important sentence]                                           |[important, simple]| |  [0.5753641449035617,0.28768207245178085]
+----------+------------------------------------------------------------------------------------------+---------------------+

Please help me that how to I resolve it

Thanks

question from:https://stackoverflow.com/questions/65915468/how-to-perform-group-by-and-aggregate-operation-on-spark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...