Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
597 views
in Technique[技术] by (71.8m points)

python - Get consecutive capitalized words using regex

I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:

"said Polly Pocket and the toys" -> Polly Pocket

Here is the regex I am using:

re.findall('said ([A-Z][w-]*(s+[A-Z][w-]*)+)', article)

It returns the following:

[('Polly Pocket', ' Pocket')]

I want it to return:

['Polly Pocket']
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use a positive look-ahead:

([A-Z][a-z]+(?=s[A-Z])(?:s[A-Z][a-z]+)+)

Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:

(                # begin capture
  [A-Z]            # one uppercase letter   First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    s               # space
    [A-Z]            # uppercase letter    Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...