Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
579 views
in Technique[技术] by (71.8m points)

python - How to select DataFrame columns based on partial matching?

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.

This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

My first solution involved a for loop and the creation of a help list:

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(one could apply any of the str functions, of course)

Then, I found the map function and got it to work with the following code:

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

Thanks, Michele

EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Select column by partial string, can simply be done, via:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string match, you can pass axis=0 to filter:

df.filter(like='hello', axis=0) 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...