Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
325 views
in Technique[技术] by (71.8m points)

normalization - Why Normalizer::normalize (PHP) doesn't work?

I'm trying to normalize strings with characters like 'áéíóú' to 'aeiou' to simplify searches.

Following the response to this question I should use the Normalizer class to do it.

The problem is that the normalize function does nothing. For example, that code:

<?php echo 'Pérez, NFC: ' . normalizer_normalize('Pérez', Normalizer::NFC) 
    . ' NFD: ' .normalizer_normalize('Pérez', Normalizer::NFD)
    . ' NFKC: ' .normalizer_normalize('Pérez', Normalizer::NFKC) 
    . ' NFKD: ' .normalizer_normalize('Pérez', Normalizer::NFKD)?>
<br/>
<?php echo 'aáà?, ê?éè,' 
    . ' FORM_C: ' . normalizer_normalize('aáà?, ê?éè', Normalizer::FORM_C )
    . ' FORM_D: ' .normalizer_normalize('aáà?, ê?éè', Normalizer::FORM_D)
    . ' FORM_KC: ' .normalizer_normalize('aáà?, ê?éè', Normalizer::FORM_KC)
    . ' FORM_KD: ' .normalizer_normalize('aáà?, ê?éè', Normalizer::FORM_KD)?>

shows:

Pérez, NFC: Pérez NFD: Pe?rez NFKC: Pérez NFKD: Pe?rez
aáà?, ê?éè, FORM_C: aáà?, ê?éè FORM_D: aa?a?a?, e?e?e?e? FORM_KC: aáà?, ê?éè FORM_KD: aa?a?a?, e?e?e?e? 

What is supposed normalize must do?

---EDITED---

It is stranger. When copy and paste the result from web browser, while in editor and original page I can see:

FORM_D: aáà?, ê?éè

in the stackoverflow question page I can see (just in Code Sample mode):

FORM_D: aa?a?a?, e?e?e?e?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Found on this page: (the linked document has different wording, the old one never exists anymore)

Unicode and internationalization is a large topic, but you should know at least one more important thing. For historical reasons, Unicode allows alternative representations of some characters. For example, á can be written either as one precomposed character á with the Unicode code point U+00E1 or as a decomposed sequence of the letter a (U+0061) combined with the accent ′ (U+0301). For purposes of comparison and sorting, two such representations should be taken as equal. To solve this, the intl library provides the Normalizer class. This class in turn provides the normalize() method, which you can use to convert a string to a normalized composed or decomposed form. Your application should consistently transform all strings to one or the other form before performing comparisons.

echo Normalizer::normalize("a′", Normalizer::FORM_C); // á  
echo Normalizer::normalize("á", Normalizer::FORM_D); // a′

So eliminating accents (and similar) is not the purpose of Normalizer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...