Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
222 views
in Technique[技术] by (71.8m points)

javascript - Finding words between special characters using Unicode regex

I have a working regular expression which matches the words below.

Input:

(T1.Test)
(AT.Test)

Match:

T1.Test
AT.Test

But when I try replacing /w with unicode p{L}, the regex does not work properly anymore.

Current expression: /(?:w+()+|(p{L}+(?:.p{L}+)?)(?!')/gu

Input:

(T1.Test)
(AT.Test)
(ワーク.Test)

Match:

Test
Test
Test

How do I make my regex works properly now it has unicode flag? My expected output should be:

T1.Test
AT.Test
ワーク.Test
question from:https://stackoverflow.com/questions/66056732/finding-words-between-special-characters-using-unicode-regex

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First of all p{L} does not catch numbers, so (T1.Test) will not be matched, while with w would be.

Your regex is diveded in two big OR parts "1 | 2":

  1. (?:w+()+ this non capturing group is matching anything of the shape anyAmmountOfLetter(. If this has success will totally ignore the rest of the regex, I don't know if it was intentional. This for example will trigger your regex: aaa(333.6780) with aaa( as full match, but 0 groups as you are not capturing it.

  2. (p{L}+(?:.p{L}+)?)(?!') this requires that you start your expression with a word boundary. But is valid in between two characters (Regex Tutorial) only if one is a word character an the other is not.

In your case, your starting round bracket will not be matched against the word boundary so (クーク.Test) will not work, but 3クーク.Test) will.

For fix that you can use only the second part (if the first is not really needed for checking something else of what you had shown in the question inputs):

// slight edited, can use digits: (3123.123) => 3123.123
input.match(/[]*(([dp{L}]+(?:.[dp{L}]+)?))[]*(?!')/gu)

// slight edited, must start with letter: (A1.Test) works, (1A.Test) doesn't
input.match(/[]*((p{L}[dp{L}]*(?:.[dp{L}]+)?))[]*(?!')/gu)

Also the last part (?!') is optional for the input cases you gave, but I suppose it is usefull for other purposes.

If you want to keep the regex simple for those inputs, this would also work:

// can use digits: (3123.123) => 3123.123
input.match(/(([p{L}d]+(?:.[p{L}d]+)))/gu)

// must start with letter: (A1.Test) works, (1A.Test) doesn't
input.match(/((p{L}[p{L}d]*(?:.[p{L}d]+)))/gu)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...