Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
359 views
in Technique[技术] by (71.8m points)

python - Spark dataframe not adding columns with null values

I am trying to create a new column by adding two existing columns in my dataframe.

Original dataframe

╔══════╦══════╗
║ cola ║ colb ║
╠══════╬══════╣
║ 1    ║ 1    ║
║ null ║ 3    ║
║ 2    ║ null ║
║ 4    ║ 2    ║
╚══════╩══════╝

Expected output with derived column

╔══════╦══════╦══════╗
║ cola ║ colb ║ colc ║
╠══════╬══════╬══════╣
║ 1    ║ 1    ║    2 ║
║ null ║ 3    ║    3 ║
║ 2    ║ null ║    2 ║
║ 4    ║ 2    ║    6 ║
╚══════╩══════╩══════╝

When I use df = df.withColumn('colc',df.cola+df.colb), it doesn't add columns with null values.

The output I get is:

╔══════╦══════╦══════╗
║ cola ║ colb ║ colc ║
╠══════╬══════╬══════╣
║ 1    ║ 1    ║ 2    ║
║ null ║ 3    ║ null ║
║ 2    ║ null ║ null ║
║ 4    ║ 2    ║ 6    ║
╚══════╩══════╩══════╝

Is there any way to incorporate the null values into the calculation. Any help would be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can coalesce to 0 to get a sum. For cases where both columns are null, you can make use of conditional functions.

For your case, the code should look something like

df.selectExpr('*', 'if(isnull(cola) and isnull(colb), null, coalesce(cola, 0) + coalesce(colb, 0)) as colc')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...