Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
832 views
in Technique[技术] by (71.8m points)

regex - How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary()

I am aware that definition of word boundary is (?<!w)(?=w)|(?<=w)(?!w) and i wish to add underscore(optionally) too in definition of word boundary.

The one way of doing it is we can simply modify the definition like the new one would be (_)?((?<!w)(?=w)|(?<=w)(?!w)) , but don't wish to use too long expression.

Easy Approach can be If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [-], but the problem is that putting inside character class i.e. [], means back space character not word boundary.

please tell the solution i.e. how to put inside character class without losing its original meaning.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may use lookarounds:

(?:|(?<=_))word(?=|_)
^^^^^^^^^^^^^     ^^^^^^^

See the regex demo where (?:|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=|_) is a positive lookahead matching either a word boundary or a _ symbol.

Unfortunately, Python re won't allow using (?<=|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).

A Python demo:

import re
rx = r"(?:|(?<=_))word(?=|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))

An alternative solution is to use custom word boundaries like (?<![^W_]) / (?![^W_]) (see online demo):

rx = r"(?<![^W_])word(?![^W_])"

The (?<![^W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...