Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
606 views
in Technique[技术] by (71.8m points)

regex - Searching for all Unicode variation of hyphens in Python

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.

The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex dd-dd and expected that to work.

However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as u2212 and xad. So I changed my regex to dd[-u2212xad]dd and it worked.

My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-u2212xad] expression?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.

You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.

You may use a PyPi regex module and use p{Pd} pattern to match any Unicode hyphen.

Or, if you can only work with re, use

[u002Du058Au05BEu1400u1806u2010-u2015u2E17u2E1Au2E3Au2E3Bu2E40u301Cu3030u30A0uFE31uFE32uFE58uFE63uFF0D]

You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.

A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use S. If you want to match any punctuation or symbols, use (?:[^ws]|_).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...