Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
371 views
in Technique[技术] by (71.8m points)

character encoding - MySQL distinction between e and é (e acute) - UNIQUE index

I have a table, students, with 3 columns: id, name, and age. I have a UNIQUE index Index_2 on columns name and age.

CREATE TABLE `bedrock`.`students` (  
    `id` INTEGER UNSIGNED NOT NULL
    AUTO_INCREMENT,   `name` VARCHAR(45)
    NOT NULL,   `age` INTEGER UNSIGNED NOT
    NULL,   PRIMARY KEY (`id`),   UNIQUE
    INDEX `Index_2` USING BTREE(`name`,
    `age`) ) ENGINE = InnoDB;

I tried this insert option:

insert into students (id, name, age)
values (1, 'Ane', 23);

which works ok. Than I've tried this one (see Ané - e acute):

insert into students (id, name, age)
values (2, 'Ané', 23);

and I receive this error message:

"Duplicate entry 'Ané-23' for key 'Index_2'"

MySQL somehow does not make any distinction between "Ane" and "Ané". How I can resolve this and why this is happening?

Charset for table students is "utf8" and collation is "utf8_general_ci".

ALTER TABLE `students` CHARACTER SET utf8 COLLATE utf8_general_ci;

Later edit1: @Crozin:

I've changed to use collation utf8_bin:

ALTER TABLE `students`
CHARACTER SET utf8 COLLATE utf8_bin;

but I receive the same error.

But if I create the table from start with charset utf8 and collation utf8_bin, like this:

CREATE TABLE `students2` ( 
`id` INTEGER UNSIGNED AUTO_INCREMENT, 
`name` VARCHAR(45),   `age`
VARCHAR(45),   PRIMARY KEY (`id`),  
UNIQUE INDEX `Index_2` USING
BTREE(`name`, `age`) ) ENGINE = InnoDB
CHARACTER SET utf8 COLLATE utf8_bin;

both below insert commands works ok:

insert into students2 (id, name, age)
values (1, 'Ane', 23); // works ok

insert into students2 (id, name, age)
values (2, 'Ané', 23); // works ok

This seems to be very weird.

Later edit 2:

I saw another answer here. I'm not sure if the user deleted or it get lost. I was just testing it:

The user wrote that first he created 3 tables with 3 different charsets:

CREATE TABLE `utf8_bin` (   `id`
int(10) unsigned NOT NULL
AUTO_INCREMENT,   `name` varchar(45)
COLLATE utf8_bin NOT NULL,   `age`
int(10) unsigned NOT NULL,   PRIMARY
KEY (`id`),   UNIQUE KEY `Index_2`
(`name`,`age`) USING BTREE )
ENGINE=InnoDB DEFAULT CHARSET=utf8
COLLATE=utf8_bin;

CREATE TABLE `utf8_unicode_ci` (  
`id` int(10) unsigned NOT NULL
AUTO_INCREMENT,   `name` varchar(45)
COLLATE utf8_unicode_ci NOT NULL,  
`age` int(10) unsigned NOT NULL,  
PRIMARY KEY (`id`),   UNIQUE KEY
`Index_2` (`name`,`age`) USING BTREE )
ENGINE=InnoDB DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;

CREATE TABLE `utf8_general_ci` (  
`id` int(10) unsigned NOT NULL
AUTO_INCREMENT,   `name` varchar(45)
COLLATE utf8_general_ci NOT NULL,  
`age` int(10) unsigned NOT NULL,  
PRIMARY KEY (`id`),   UNIQUE KEY
`Index_2` (`name`,`age`) USING BTREE )
ENGINE=InnoDB DEFAULT CHARSET=utf8
COLLATE=utf8_general_ci;

The results of the user are:

Insert commands: INSERT INTO utf8_bin
VALUES (1, 'Ane', 23), (2, 'Ané', 23);
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

INSERT INTO utf8_unicode_ci VALUES (1,
'Ane', 23), (2, 'Ané', 23); Query OK,
2 rows affected (0.01 sec) Records: 2 
Duplicates: 0  Warnings: 0

INSERT INTO utf8_general_ci VALUES (1,
'Ane', 23), (2, 'Ané', 23); Query OK,
2 rows affected (0.01 sec) Records: 2 
Duplicates: 0  Warnings: 0

Here are my results:

INSERT INTO utf8_bin VALUES (1, 'Ane',
23), (2, 'Ané', 23);        //works ok
INSERT INTO utf8_unicode_ci VALUES (1,
'Ane', 23), (2, 'Ané', 23); //
Duplicate entry 'Ané-23' for key
'Index_2'

INSERT INTO utf8_general_ci VALUES (1,
'Ane', 23), (2, 'Ané', 23);
//Duplicate entry 'Ané-23' for key
'Index_2'

I'm not sure why in his part this INSERT command worked and for me doesn't work.

He also wrote that he tested this on Mysql on Linux - has to do something with this?! Even I do not think so.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

and collation is "utf8_general_ci".

And that's the answer. If you're using utf8_general_ci (actually it applies to all utf_..._[ci|cs]) collation then diacritics are bypassed in comarison, thus:

SELECT "e" = "é" AND "O" = "ó" AND "?" = "a"

Results in 1. Indexes also use collation.

If you want to distinguish between ? and a then use utf8_bin collation (keep in mind that it also distinguish between uppercase and lowercase characters).


By the way name and age don't guarantee any uniqueness.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...