pyspark - How to perform group by and aggregate operation on spark

Question

Welcome To Ask or Share your Answers For Others

pyspark - How to perform group by and aggregate operation on spark

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - How to perform group by and aggregate operation on spark

I have a Dataset below like:

+----------------------------------+--------------------------------------------------------------------+----------+
|word                              |features                                                            |prediction|
+----------------------------------+--------------------------------------------------------------------+----------+
|simple sentence                   |(2000,[1092,1980],[0.0,0.5753641449035617])                         |1         |
|simple important sentence         |(2000,[537,1092,1980],[0.28768207245178085,0.0,0.28768207245178085])|0         |
|important sentence                |(2000,[537,1092],[0.5753641449035617,0.0])                          |0         |
+----------------------------------+--------------------------------------------------------------------+----------+

here I have 2 clusters (0 and 1),

I want to select the words that have the most weight in each cluster (for at least 2 words)

for example:

(For convenience, I specified weights for each word )

 prediction
 1          sentence(1092)(0.0)                   simple(1980)(0.5753641449035617)                     
 0          important(537)(0.28768207245178085)   sentence(1092)(0.0)    simple(1980)(0.28768207245178085)
 0          important(537)(0.5753641449035617)    sentence(1092)(0.0)

So based on the above dataset The highest weight among the words of cluster 1 is related to

"simple"(0.5753641449035617) and "sentence"(0.0)

also the highest weight in cluster 0 is related to

"important"(0.5753641449035617) and "simple"(0.28768207245178085)

Based on the above, I expect the output to look like the following

|prediction|docname                                                                                   |top_terms            |   weight
+----------+------------------------------------------------------------------------------------------+---------------------+ ---------------------+
|1         |[simple sentence simple]                                                                  |[simple, sentence]   |  [0.0,0.5753641449035617]
|0         |[simple important sentence, important sentence]                                           |[important, simple]| |  [0.5753641449035617,0.28768207245178085]
+----------+------------------------------------------------------------------------------------------+---------------------+

Please help me that how to I resolve it

Thanks

question from:https://stackoverflow.com/questions/65915468/how-to-perform-group-by-and-aggregate-operation-on-spark

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

pyspark - How to perform group by and aggregate operation on spark

pyspark - How to perform group by and aggregate operation on spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags