apache spark - Keep last when using dropduplicates?

Question

Welcome To Ask or Share your Answers For Others

apache spark - Keep last when using dropduplicates?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Keep last when using dropduplicates?

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following:

from pyspark.sql import Row
df = sc.parallelize([ 
    Row(name='Alice', age=5, height=80), 
    Row(name='Alice', age=5, height=80), 
    Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

And I run:

df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

I would like the following:

+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
+---+------+-----+

The keep=last does not appear to be an option in pyspark?

question from:https://stackoverflow.com/questions/66049380/keep-last-when-using-dropduplicates

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:16:03+0000

The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rank',
    F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')

df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10|    80|
+-----+---+------+

Or another way is to use last, but it gives indeterministic results:

import pyspark.sql.functions as F

df2 = df.groupBy('name', 'height').agg(
    *[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)

df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80| 10|
+-----+------+---+

Categories

apache spark - Keep last when using dropduplicates?

apache spark - Keep last when using dropduplicates?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags