Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
735 views
in Technique[技术] by (71.8m points)

regex - Python re infinite execution

I'm trying to execute this code :

import re
pattern = r"(w+)*([ws]+)*/$"
re_compiled = re.compile(pattern)
results = re_compiled.search('COPRO*HORIZON 2000                 HOR')
print(results.groups())

But Python does not respond. The process takes 100% of the CPU and does not stop. I've tried this both on Python 2.7.1 and Python 3.2 with identical results.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your regex runs into catastrophic backtracking because you have nested quantifiers (([...]+)*). Since your regex requires the string to end in / (which fails on your example), the regex engine tries all permutations of the string in the vain hope to find a matching combination. That's where it gets stuck.

To illustrate, let's assume "A*BCD" as the input to your regex and see what happens:

  1. (w+) matches A. Good.
  2. * matches *. Yay.
  3. [ws]+ matches BCD. OK.
  4. / fails to match (no characters left to match). OK, let's back up one character.
  5. / fails to match D. Hum. Let's back up some more.
  6. [ws]+ matches BC, and the repeated [ws]+ matches D.
  7. / fails to match. Back up.
  8. / fails to match D. Back up some more.
  9. [ws]+ matches B, and the repeated [ws]+ matches CD.
  10. / fails to match. Back up again.
  11. / fails to match D. Back up some more, again.
  12. How about [ws]+ matches B, repeated [ws]+ matches C, repeated [ws]+ matches D? No? Let's try something else.
  13. [ws]+ matches BC. Let's stop here and see what happens.
  14. Darn, / still doesn't match D.
  15. [ws]+ matches B.
  16. Still no luck. / doesn't match C.
  17. Hey, the whole group is optional (...)*.
  18. Nope, / still doesn't match B.
  19. OK, I give up.

Now that was a string of just three letters. Yours had about 30, trying all permutations of which would keep your computer busy until the end of days.

I suppose what you're trying to do is to get the strings before/after *, in which case, use

pattern = r"(w+)*([ws]+)$"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...