Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
483 views
in Technique[技术] by (71.8m points)

python - Descriptive stats from frequency table in pandas

I have a frequency table of test scores:

score    count
-----    -----
  77      1105
  78       940
  79      1222
  80      4339
etc

I want to show basic statistics and a boxplot for the sample which is summarized by the frequency table. (For example, the mean of the above example is 79.16 and the median is 80.)

Is there a way to do this in Pandas? All the examples I have seen assume a table of individual cases.

I suppose I could generate a list of individual scores, like this --

In [2]: s = pd.Series([77] * 1105 + [78] * 940 + [79] * 1222 + [80] * 4339)
In [3]: s.describe()
Out[3]: 
count    7606.000000
mean       79.156324
std         1.118439
min        77.000000
25%        78.000000
50%        80.000000
75%        80.000000
max        80.000000
dtype: float64

-- but I am hoping to avoid that; total frequencies in the real non-toy dataset are well up in the billions.

Any help appreciated.

(I think this is a different question from Using describe() with weighted data, which is about applying weights to individual cases.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's a small function that calculates decriptive statistics for frequency distributions:

# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
    values = np.array(values)
    freqs = np.array(freqs)
    arg_sorted = np.argsort(values)
    values = values[arg_sorted]
    freqs = freqs[arg_sorted]
    count = freqs.sum()
    fx = values * freqs
    mean = fx.sum() / count
    variance = ((freqs * values**2).sum() / count) - mean**2
    variance = count / (count - 1) * variance  # dof correction for sample variance
    std = np.sqrt(variance)
    minimum = np.min(values)
    maximum = np.max(values)
    cumcount = np.cumsum(freqs)
    Q1 = values[np.searchsorted(cumcount, 0.25*count)]
    Q2 = values[np.searchsorted(cumcount, 0.50*count)]
    Q3 = values[np.searchsorted(cumcount, 0.75*count)]
    idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
    return result

A demo:

np.random.seed(0)

val = np.random.normal(100, 5, 1000).astype(int)

pd.Series(val).describe()
Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)

Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

Note that this doesn't handle NaN's and is not properly tested.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...