Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
981 views
in Technique[技术] by (71.8m points)

arrays - Natural sorting algorithm in PHP with support for Unicode?

Is it possible to sort an array with Unicode / UTF-8 characters in PHP using a natural order algorithm? For example (the order in this array is correctly ordered):

$array = array
(
    0 => 'Agile',
    1 => 'ágile',
    2 => 'àgile',
    3 => '?gile',
    4 => '?gile',
    5 => '?gile',
    6 => 'Test',
);

If I try with asort($array) I get the following result:

Array
(
    [0] => Agile
    [6] => Test
    [2] => àgile
    [1] => ágile
    [3] => ?gile
    [5] => ?gile
    [4] => ?gile
)

And using natsort($array):

Array
(
    [2] => àgile
    [1] => ágile
    [3] => ?gile
    [5] => ?gile
    [4] => ?gile
    [0] => Agile
    [6] => Test
)

How can I implement a function that returns the correct result order (0, 1, 2, 3, 4, 5, 6) under PHP 5? All the multi byte string functions (mbstring, iconv, ...) are available on my system.

EDIT: I want to natsort() the values, not the keys - the only reason why I'm explicitly defining the keys (and using asort() instead of sort()) is to ease the job of finding out where the sorting of unicode values went wrong.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The question is not as easy to answer as it seems on the first look. This is one of the areas where PHP's lack of unicode supports hits you with full strength.

Frist of all natsort() as suggested by other posters has nothing to do with sorting arrays of the type you want to sort. What you're looking for is a locale aware sorting mechanism as sorting strings with extended characters is always a question of the used language. Let's take German for example: A and ? can sometimes be sorted as if they were the same letter (DIN 5007/1), and sometimes ? can be sorted as it was in fact "AE" (DIN 5007/2). In Swedish, in contrast, ? comes at the end of the alphabet.

If you don't use Windows, you're lucky as PHP provides some functions to exactly this. Using a combination of setlocale(), usort(), strcoll() and the correct UTF-8 locale for your language, you get something like this:

$array = array('àgile', 'ágile', '?gile', '?gile', '?gile', 'Agile', 'Test');
$oldLocal = setlocale(LC_COLLATE, '<<your_RFC1766_language_code>>.utf8');
usort($array, 'strcoll');
setlocale(LC_COLLATE, $oldLocal);

Please note that it's mandatory to use the UTF-8 locale variant in order to sort UTF-8 strings. I reset the locale in the example above to its original value as setting a locale using setlocale() can introduce side-effects in other running PHP script - please see PHP manual for more details.

When you do use a Windows machine, there is currently no solution to this problem and there won't be any before PHP 6 I assume. Please see my own question on SO targeting this specific problem.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...