Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
601 views
in Technique[技术] by (71.8m points)

regex - Declaration to make PHP script completely Unicode-friendly

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.

The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.

Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):

  • The PHP script source is itself in considered to be in UTF?8 (eg, strings and regexes).

  • All input and output is automatically converted to/from UTF?8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).

  • All functions with Unicode versions use those instead (eg, Collator::sort for sort).

  • All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).

  • All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like w and and s all work on Unicode the way The Unicode Standard requires them to work, etc).

For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.


One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example, mb_substr() is called instead of substr() if function overloading is enabled.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...