The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rank',
F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')
df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10| 80|
+-----+---+------+
Or another way is to use last
, but it gives indeterministic results:
import pyspark.sql.functions as F
df2 = df.groupBy('name', 'height').agg(
*[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)
df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice| 80| 10|
+-----+------+---+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…