python - Pyspark replace NaN with NULL

Question

Welcome To Ask or Share your Answers For Others

python - Pyspark replace NaN with NULL

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pyspark replace NaN with NULL

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I tried something like this:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

But I got the following error:

ValueError: value should be a float, int, long, string, bool or dict

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:36:19+0000

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+        
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

df = df.replace(float('nan'), None)
df.show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

You can use the .replace function to change to null values in one line of code.

Categories

python - Pyspark replace NaN with NULL

python - Pyspark replace NaN with NULL

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags