Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
495 views
in Technique[技术] by (71.8m points)

javascript - How can I make a regular expression which takes accented characters into account?

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that

A word boundary ("") is a spot between two characters that has a "w" on one side of it and a "W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "W". AS3 RegExp to match words with boundry type characters in them

And since

w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml

obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..

Any help?

Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:

var re_state = new RegExp("\b([a-z]{2})[,]?\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

While JavaScript regexes recognize non-ASCII characters in some cases (like s), it's hopelessly inadequate when it comes to w and . If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.

By the way, there's an error in your regex. You have a after the optional trailing comma, but it should be in front:

"\b([a-z]{2})\b,?"

I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:

"\b[a-z]{2}\b"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...