Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
295 views
in Technique[技术] by (71.8m points)

python - 使用groupby和mode的pandas fillna(Pandas fillna using groupby and mode)

I recently started working with Pandas and I'm currently trying to impute some missing values in my dataset.

(我最近开始与Pandas合作,目前正在尝试在数据集中估算一些缺失值。)

I want to impute the missing values based on the median (for numerical entries) and mode (for categorical entries).

(我想根据中位数(用于数字输入)和模式(用于类别输入)来估算缺失值。)

However, I do not want to calculate the median and mode over the whole dataset, but based on a GroupBy of my column called "make" .

(但是,我不想计算整个数据集的中位数和众数,而是基于我的名为"make"列的GroupBy 。)

For numerical values I have done the following:

(对于数值,我做了以下工作:)

data = data.fillna(data.groupby("make").transform("median"))

--> this works perfectly and replaces all my numerical NA values with the median of their "make" .

(->效果很好,并用"make"的中值替换了我所有的数值NA值。)

However, I couldn't manage to do the same thing for the mode, ie replace all categorical NA values with the mode of their "make" .

(但是,我无法对该模式执行相同的操作,即用其"make"模式替换所有类别的NA值。)

Does anyone know how to do it?

(有人知道怎么做吗?)

  ask by mt1212 translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use GroupBy.transform with if-else for median for numeric and mode for categorical columns:

(您可以将GroupBy.transformif-else一起用于数字的median和用于分类列的mode :)

df = pd.DataFrame({
         'A':list('ebcded'),
         'B':[np.nan,np.nan,4,5,5,4],
         'C':[7,np.nan,9,4,2,3],
         'D':[1,3,5,np.nan,1,0],
         'F':list('aaabbb'),
         'G':list('aaabbb')
})

df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
     A    B    C    D    F  G
0    e  NaN  7.0  1.0    a  a
1    b  NaN  NaN  3.0    a  a
2  NaN  4.0  9.0  5.0  NaN  a
3    d  5.0  4.0  NaN    b  b
4  NaN  5.0  2.0  1.0    b  b
5    d  4.0  3.0  0.0  NaN  b

f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('G').transform(f))
print (df)

   A  B  C  D  F  G
0  e  4  7  1  a  a
1  b  4  7  3  a  a
2  b  4  9  5  a  a
3  d  5  4  0  b  b
4  d  5  2  1  b  b
5  d  4  3  0  b  b

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...