python - How to get a total count based on distinction of two columns with PySpark?

Question

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

How do I do a summation of the frequency based on distinct ID & Location in PySpark?

Feels like I need to do window partition by ID and Location and then add the frequency but not sure how to write this in Pyspark code:

Input

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:09:22+0000

Just a simple group by and sum:

import pyspark.sql.functions as F

df2 = df.groupBy('ID', 'Location').agg(F.sum('Frequency').alias('TotalFrequency'))