Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

regex - Extract Arabic phrases from a given text in java

Can you help me in finding a regex that take list of phrases and check if one of these phrases exist in the given text, please?

Example:

If I have in the hashSet the following words:

??? ?????  
??? ???  
??? ????  
?? ?? ??? ???  

And the given text is: ??? ????? ????? ?? ???? ????

I want to get after performing regex: ??? ?????

My initial code:

HashSet<String> QWWords = new HashSet<String>();

QWWords.add("??? ?????");
QWWords.add("??? ???");
QWWords.add("??? ????");
QWWords.add("?? ?? ??? ???");

String s1 = "??? ????? ????? ?? ???? ????";

for (String qp : QWWords) {

    Pattern p = Pattern.compile("[\s" + qp + "\s]");

    Matcher m = p.matcher(s1);

    String found = "";

    while (m.find()) {
        found = m.group();
        System.out.println(found);

    }

}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

[...] is character class and character class can match only one character it specifies. For instance character class like [abc] can match only a OR b OR c. So if you want to find only word abc don't surround it with [...].

Another problem is that you are using \s as word separator, so in following String

String data = "foo foo foo foo";

regex \sfoo\s will not be able to match first foo because there is no space before.
So first match it will find will be

String data = "foo foo foo foo";
//      this one--^^^^^

Now, since regex consumed space after second foo it can't reuse it in next match so third foo will also be skipped because there is no space available to match before it.
You will also not match forth foo because this time there is no space after it.

To solve this problem you can use \b - word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).

So instead of

Pattern p = Pattern.compile("[\s" + qp + "\s]");

use

Pattern p = Pattern.compile("\b" + qp + "\b");

or maybe better as Tim mentioned

Pattern p = Pattern.compile("\b" + qp + "\b",Pattern.UNICODE_CHARACTER_CLASS);

to make sure that \b will include Arabic characters in predefined alphanumeric class.

UPDATE:

I am not sure if your words can contain regex metacharacters like { [ + * and so on, so just in case you can also add escaping mechanism to change such characters into literals.

So

"\b" + qp + "\b"

can become

"\b" + Pattern.quote(qp) + "\b"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...