Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
427 views
in Technique[技术] by (71.8m points)

php - Remove all except the chinese characters with regex?

I have a string that is a sentence, written in chinese.

This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.

Using regex with a latin1 string, I could use preg_replace and [a-zA-Z] to clean it and remove the filler.

How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

According to this document, here are the unicode ranges of chinese characters:

Table 12-2. Blocks Containing Han Ideographs

Block                                Range         Comment
CJK Unified Ideographs               4E00–9FFF     Common
CJK Unified Ideographs Extension A   3400–4DBF     Rare
CJK Unified Ideographs Extension B   20000–2A6DF   Rare, historic
CJK Unified Ideographs Extension C   2A700–2B73F   Rare, historic
CJK Unified Ideographs Extension D   2B740–2B81F   Uncommon, some in current use
CJK Compatibility Ideographs         F900–FAFF     Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

You could use it like this:

preg_replace('/[^u4E00-u9FFF]+/u', '', $string);

or

preg_replace('/P{Han}+/u', '', $string);

where P is the negation of p

see here for all the unicode scripts


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...