Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
599 views
in Technique[技术] by (71.8m points)

select - MySQL matching unicode characters with ascii version

I'm running MySQL 5.1.50 and have a table that looks like this:

organizations | CREATE TABLE `organizations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |

The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
    +-------+-------------------------+
| id    | name                    |
+-------+-------------------------+
| 16973 | Université de Montreal  |
+-------+-------------------------+
1 row in set (0.01 sec)

I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You specified the name column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci both make a lot of things equivalent.

http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.

If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):

mysql> show collation like 'utf8%';

And choose using the collation charts as a reference.

One more special collation is utf8_bin in which there are no equivalencies, it's a binary match.

The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci, utf8_general_ci and utf8_bin. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.

For example, it seems Danes consider e and é equivalent but Icelanders don't:

mysql> select _utf8'e' collate utf8_danish_ci
    -> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
|     1 |
+-------+

mysql> select _utf8'e' collate utf8_icelandic_ci
    -> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
|     0 |
+-------+

Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('?'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ?               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ?               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,?,á           |
+-----------------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...