Character set and collation for simplified Chinese — MySQL


I am curious about character sets and collations, especially how they are used in databases. I got some time to play with them recently. I did some testing today on MySQL. I will do the same test on Sql Server, Oracle, and PostgreSql, time permitting. I am only dealing with simplified Chinese at this point. I may take up traditional Chinese too.

Here is my setup:

1. I created a table that stores simplified Chinese characters in different character set, along with collation used, pinyin, number of strokes, and tone value. There are 126 collations in MySQL, only 10 of which are suited for simplified Chinese.

2. I used the Chinese version of the golden rule, 己所不欲, 勿施于人, from Confucius, as my sample characters. Of those 8 characters, 欲 and 于 have the same pinyin. To make the test more interesting, I added one more character, 旗, since it has the same radical as 施.

Here are the findings:

1. I don’t see any difference between the _bin and _chinese_ci collations. In all cases, they produced identical results;

2. Big5 sorts on number of strokes. gb2312 and gbk sort on pinyin. Yes, all of this is stated in the documentation, but I want to see it for myself;

3. ucs2 and utf8 collations generate the same sort results. Interestingly, it is only through those two that characters share the same radicals are sorted next to each other, at least based on my sample. I tend to believe that should be true in general. Another interesting thing is that with ucs2 and utf8, all 独体字 are sorted together, and all 合体字 are sorted together, at least based on my sample.

4. (This is an updated item, thanks to the comment from reader Jedy) However, if you have any English letter as a beginning of the string, regardless of the string is all English letters or English mixed with Chinese characters, the _bin and _CI collation do make a difference. For _bin collations, capital English letters always come first. For _CI collations, non-capitalized English letters come first.

Below is the script I used and the results, if you are interested.

SQL query: SELECT big5_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY big5_bin

big5_bin PinYin StrokeNo

Tone
ren 2 2
yu 3 2
ji 3 3
bu 4 2
wu 4 4
suo 8 3
shi 9 1
yu 11 4
qi 14 2

SQL query: SELECT big5_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY big5_chinese_ci

big5_chinese_ci PinYin StrokeNo

Tone
ren 2 2
ji 3 3
yu 3 2
bu 4 2
wu 4 4
suo 8 3
shi 9 1
yu 11 4
qi 14 2

SQL query: SELECT gb2312_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY gb2312_bin

gb2312_bin PinYin StrokeNo

Tone
bu 4 2
ji 3 3
qi 14 2
ren 2 2
shi 9 1
suo 8 3
wu 4 4
yu 3 2
yu 11 4

SQL query: SELECT gb2312_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY gb2312_chinese_ci

gb2312_chinese_ci PinYin StrokeNo

Tone
bu 4 2
ji 3 3
qi 14 2
ren 2 2
shi 9 1
suo 8 3
wu 4 4
yu 3 2
yu 11 4

SQL query: SELECT gbk_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY gbk_bin

gbk_bin PinYin StrokeNo

Tone
bu 4 2
ji 3 3
qi 14 2
ren 2 2
shi 9 1
suo 8 3
wu 4 4
yu 3 2
yu 11 4

SQL query: SELECT gbk_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY gbk_chinese_ci

gbk_chinese_ci PinYin StrokeNo

Tone
bu 4 2
ji 3 3
qi 14 2
ren 2 2
shi 9 1
suo 8 3
wu 4 4
yu 3 2
yu 11 4

SQL query: SELECT ucs2_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY ucs2_bin

ucs2_bin PinYin StrokeNo

Tone
bu 4 2
yu 3 2
ren 2 2
wu 4 4
ji 3 3
suo 8 3
shi 9 1
qi 14 2
yu 11 4

SQL query: SELECT ucs2_unicode_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY ucs2_unicode_ci

ucs2_unicode_ci PinYin StrokeNo

Tone
bu 4 2
yu 3 2
ren 2 2
wu 4 4
ji 3 3
suo 8 3
shi 9 1
qi 14 2
yu 11 4

SQL query: SELECT utf8_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY utf8_bin

utf8_bin PinYin StrokeNo

Tone
bu 4 2
yu 3 2
ren 2 2
wu 4 4
ji 3 3
suo 8 3
shi 9 1
qi 14 2
yu 11 4

SQL query: SELECT utf8_unicode_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY utf8_unicode_ci

utf8_unicode_ci PinYin StrokeNo

Tone
bu 4 2
yu 3 2
ren 2 2
wu 4 4
ji 3 3
suo 8 3
shi 9 1
qi 14 2
yu 11 4
,

3 responses to “Character set and collation for simplified Chinese — MySQL”

  1. 当字段中有英文字符时,gb2312_bin和gb2312_chinese_ci的排序结果不同。

  2. 谢谢你,Jedy!你的评论给了我启发,我会再做一些实验看一看。

  3. Jedy,

    我重新做了实验。谢谢你的启发。我已经把新的结果补充到上面第四条。

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.