Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
238 views
in Technique[技术] by (71.8m points)

apache spark - Keep last when using dropduplicates?

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following:

from pyspark.sql import Row
df = sc.parallelize([ 
    Row(name='Alice', age=5, height=80), 
    Row(name='Alice', age=5, height=80), 
    Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

And I run:

df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

I would like the following:

+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
+---+------+-----+

The keep=last does not appear to be an option in pyspark?

question from:https://stackoverflow.com/questions/66049380/keep-last-when-using-dropduplicates

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rank',
    F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')

df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10|    80|
+-----+---+------+

Or another way is to use last, but it gives indeterministic results:

import pyspark.sql.functions as F

df2 = df.groupBy('name', 'height').agg(
    *[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)

df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80| 10|
+-----+------+---+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...