Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

regex - Split string into sentences in javascript

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.

What i do is this:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/.+/g,'.|').replace(/?/g,'?|').replace(/!/g,'!|').split("|");

The result is:

Array [
    "This is a long string with some numbers [125.",
    "000,55 and 140.",
    "000] and an end.",
    " This is another sentence."
]

The desired result would be:

Array [
    "This is a long string with some numbers [125.000, 140.000] and an end.",
    "This is another sentence"
]

How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". ", "? " and "! "?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
str.replace(/([.?!])s*(?=[A-Z])/g, "$1|").split("|")

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

Breakdown:

([.?!]) = Capture either . or ? or !

s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


The replace operation uses:

"$1|"

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.


So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...