Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
770 views
in Technique[技术] by (71.8m points)

regex - searching matching string pattern from dataframe column in python pandas

i have a data-frame like below

 name         genre
 satya      |ACTION|DRAMA|IC|
 satya      |COMEDY|BIOPIC|SOCIAL|
 abc        |CLASSICAL|
 xyz        |ROMANCE|ACTION|DARMA|
 def        |DISCOVERY|SPORT|COMEDY|IC|
 ghj        |IC|

Now I want to query the dataframe so that i can get row 1,5 and 6.i:e i want to find |IC| with alone or with any combination of other genres.

Upto now i am able to do either a exact search using

df[df['genre'] == '|ACTION|DRAMA|IC|']  ######exact value yields row 1

or a string contains search by

 df[df['genre'].str.contains('IC')]  ####yields row 1,2,3,5,6
 # as BIOPIC has IC in that same for CLASSICAL also

But i don't want these two.

#df[df['genre'].str.contains('|IC|')]  #### row 6
# This also not satisfying my need as i am missing rows 1 and 5

So my requirement is to find genres having |IC| in them.(My string search fails because python treats '|' as or operator)

Somebody suggest some reg or any method to do that.Thanks in ADv.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think you can add to regex for escaping , because | without is interpreted as OR:

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use |, or enclose it inside a character class, as in [|].

print df['genre'].str.contains(u'|IC|')
0     True
1    False
2    False
3    False
4     True
5     True
Name: genre, dtype: bool

print df[df['genre'].str.contains(u'|IC|')]
    name                        genre
0  satya            |ACTION|DRAMA|IC|
4    def  |DISCOVERY|SPORT|COMEDY|IC|
5    ghj                         |IC|

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...