Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
859 views
in Technique[技术] by (71.8m points)

pandas - Retain few NA's and drop rest of NA's during Stack operation in Python

I have a dataframe like shown below

df2 = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})

enter image description here

As shown in my screenshot above, my source datframe (df2) contains few NA's

When I do df2.stack(), I lose all the NA's from the data.

However I would like to retain NA for H7_date and H8 because they have got their corresponding value / date pair. For H7_date, I have a valid value H7 and for H8, I have got it's corresponding H8_date.

I would like to drop records only when both the values (H5_date,H5) are NA.

Please note I have got only few columns here and my real data has more than 150 columns and column names aren't known in advance.

I expect my output to be like as shown below which doesn't have H5_date,H5 though they are NA's

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

On approach is to melt the DF, apply a key that identifies columns in the same "group" (in this case H<some digits> but you can amend that as required), then group by person and that key, filter those groups to those containing at least one non-NA value), eg:

Starting with:

df = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})

Use:

df2 = (
    df.melt(id_vars='person_id')
    .assign(_gid=lambda v: v.variable.str.extract('H(d+)'))
    .groupby(['person_id', '_gid'])
    .filter(lambda g: bool(g.value.any()))
    .drop('_gid', 1)
)

Which gives you:

    person_id variable                value
0           1  H1_date  2006-10-30 00:00:00
1           1       H1                  2.3
2           1  H2_date  2016-10-30 00:00:00
3           1       H2                 12.3
4           1  H3_date  2026-11-30 00:00:00
5           1       H3                 22.3
6           1  H4_date  2106-10-30 00:00:00
7           1       H4                 42.3
10          1  H6_date  2006-10-30 00:00:00
11          1       H6                  2.3
12          1  H7_date                  NaN
13          1       H7                  2.3
14          1  H8_date  2006-10-30 00:00:00
15          1       H8                  NaN

You can then use that as a starting point to tweak if necessary.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...