Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
866 views
in Technique[技术] by (71.8m points)

regex - Splitting strings through regular expressions by punctuation and whitespace etc in java

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

String.split([\p{Punct}\s+])"

But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".

Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.

What could be a better regex for the purpose I have described?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\p{Punct}\s]+");

[\p{Punct}\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word "can't" should";

String[] Res = Text.split("[\p{Punct}\s]+");
System.out.println(Res.length);
for (String s:Res){
    System.out.println(s);
}

this result

10
But
I
know
For
example
the
word
can
t
should

Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\P{L}+");

\P{L} means is not a unicode code point that has the property "Letter"


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...