python - Making histogram with Spark DataFrame column

Question

Welcome To Ask or Share your Answers For Others

python - Making histogram with Spark DataFrame column

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Making histogram with Spark DataFrame column

I am trying to make a histogram with a column from a dataframe which looks like

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?

Some things I have tried are

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:09:39+0000

The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

Categories

python - Making histogram with Spark DataFrame column

python - Making histogram with Spark DataFrame column

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags