Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
843 views
in Technique[技术] by (71.8m points)

utf 8 - is PHP str_word_count() multibyte safe?

I want to use str_word_count() on a UTF-8 string.

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

So I guess I want to know...

  1. Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

  2. Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

This is where the problem might lie I guess.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

And perhaps as well:

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "

";

$test   = "awordxA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...