Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
603 views
in Technique[技术] by (71.8m points)

regex - Javascript unicode string, chinese character but no punctuation

I am trying to scrap a unicode string using javascript. Said string could countain mixed characters. Example: 我的中文不好。我是意大利人。你知道吗?

Ultimately, the string may contain - Chinese characters - Chinese punctuation - ANSI characters and punctuation

I need to leave the Chinese characters only . Any hint ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can see the relevant blocks at http://www.unicode.org/reports/tr38/#BlockListing or http://www.unicode.org/charts/ .

If you are excluding compatibility characters (ones which should no longer be used), as well as strokes, radicals, and Enclosed CJK Letters and Months, the following ought to cover it (I've added the individual JavaScript equivalent expressions afterward):

  • CJK Unified Ideographs (4E00-9FCC) [u4E00-u9FCC]
  • CJK Unified Ideographs Extension A (3400-4DB5) [u3400-u4DB5]
  • CJK Unified Ideographs Extension B (20000-2A6D6) [ud840-ud868][udc00-udfff]|ud869[udc00-uded6]
  • CJK Unified Ideographs Extension C (2A700-2B734) ud869[udf00-udfff]|[ud86a-ud86c][udc00-udfff]|ud86d[udc00-udf34]
  • CJK Unified Ideographs Extension D (2B840-2B81D) ud86d[udf40-udfff]|ud86e[udc00-udc1d]
  • 12 characters within the CJK Compatibility Ideographs (F900-FA6D/FA70-FAD9) but which are actually CJK unified ideographs [uFA0EuFA0FuFA11uFA13uFA14uFA1FuFA21uFA23uFA24uFA27-uFA29]

...so, a regex to grab the Chinese characters would be:

/[u4E00-u9FCCu3400-u4DB5uFA0EuFA0FuFA11uFA13uFA14uFA1FuFA21uFA23uFA24uFA27-uFA29]|[ud840-ud868][udc00-udfff]|ud869[udc00-uded6udf00-udfff]|[ud86a-ud86c][udc00-udfff]|ud86d[udc00-udf34udf40-udfff]|ud86e[udc00-udc1d]/

Due in fact to the many CJK (Chinese-Japanese-Korean) characters, Unicode was expanded to handle more characters beyond the "Basic Multilingual Plane" (called "astral" characters), and since the CJK Unified Ideographs extensions B-D are examples of such astral characters, those extensions have ranges that are more complicated because they have to be encoded using surrogate pairs in UTF-16 systems like JavaScript. A surrogate pair consists of a high surrogate and a low surrogate, neither of which is valid by itself but when joined together form an actual single character despite their string length being 2).

While it would probably be easier for replacement purposes to express this as the non-Chinese characters (to replace them with the empty string), I provided the expression for the Chinese characters instead so that it would be easier to track in case you needed to add or remove from the blocks.

Update September 2017

As of ES6, one may express the regular expressions without resorting to surrogates by using the "u" flag along with the code point inside of the new escape sequence with brackets, e.g., /^[u{20000}-u{2A6D6}]*$/u for "CJK Unified Ideographs Extension B".

Note that Unicode too has progressed to include "CJK Unified Ideographs Extension E" ([u{2B820}-u{2CEAF}]) and "CJK Unified Ideographs Extension F" ([u{2CEB0}-u{2EBEF}]).

For ES2018, it appears that Unicode property escapes will be able to simplify things even further. Per http://2ality.com/2017/07/regexp-unicode-property-escapes.html , it looks like will be able to do:

/^(p{Block=CJK Unified Ideographs}|p{Block=CJK Unified Ideographs Extension A}|p{Block=CJK Unified Ideographs Extension B}|p{Block=CJK Unified Ideographs Extension C}|p{Block=CJK Unified Ideographs Extension D}|p{Block=CJK Unified Ideographs Extension E}|p{Block=CJK Unified Ideographs Extension F}|[uFA0EuFA0FuFA11uFA13uFA14uFA1FuFA21uFA23uFA24uFA27-uFA29])+$/u

And as the shorter aliases from http://unicode.org/Public/UNIDATA/PropertyAliases.txt and http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt can also be used for these blocks, you could shorten this to the following (and changing underscores to spaces or casing apparently too if desired): /^(p{Blk=CJK}|p{Blk=CJK_Ext_A}|p{Blk=CJK_Ext_B}|p{Blk=CJK_Ext_C}|p{Blk=CJK_Ext_D}|p{Blk=CJK_Ext_E}|p{Blk=CJK_Ext_F}|[uFA0EuFA0FuFA11uFA13uFA14uFA1FuFA21uFA23uFA24uFA27-uFA29])+$/u

And if we wanted to improve readability, we could document the falsely labeled compatibility characters using named capture groups (see http://2ality.com/2017/05/regexp-named-capture-groups.html ):

/^(p{Blk=CJK}|p{Blk=CJK_Ext_A}|p{Blk=CJK_Ext_B}|p{Blk=CJK_Ext_C}|p{Blk=CJK_Ext_D}|p{Blk=CJK_Ext_E}|p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[uFA0EuFA0FuFA11uFA13uFA14uFA1FuFA21uFA23uFA24uFA27-uFA29]))+$/u

And as it looks per http://unicode.org/reports/tr44/#Unified_Ideograph like the "Unified_Ideograph" property (alias "UIdeo") covers all of our unified ideographs and excluding symbols/punctuation and compatibility characters, if you don't need to pick and choose out of the above, the following may be all you need:

/^p{Unified_Ideograph=yes}*$/u

or in shorthand:

/^p{UIdeo=y}*$/u


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...