Convert character set to utf8 in MySQL

Update: If you are interested in automating this task with a simple Perl script, I created one here.

Say, you had a MySQL database that stores texts in languages such as Chinese, Japanese, Korean, Hebrew, Arabic, and so on, and recently there was a server move or upgrade, and all of sudden text in the above mentioned languages didn’t display properly. What do we do to fix it then?

That was what happened to me recently. My hosting company migrated this site to a new LAMP server, and managed to screw up the character set so my Chinese entries looked like gibberish. I spent some time researching, and was able to convert my WordPress database character set to utf8 successfully. Below is the notes for my own consumption. If it helps you, dear reader, all the better!

Please follow the procedures below step by step in the order presented here. If you need help running sql script using command line tools, I have a little discussion here. I am assuming you have access to command line tools or things like phpMyAdmin. Please remember to replace MyDb with your database name.

1. Switch to information_schema database by:

USE information_schema;

2. Run this statement below. It will generate sql statements to be executed later. Save the result in a sql script file.

SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', REPLACE(column_type, 'char', 'binary'), ';') FROM columns WHERE table_schema = 'MyDb' and data_type LIKE '%char%';

3. Run this statement below. It will generate sql statements to be executed later. Save the result in the same sql script file, below the sql statements generated earlier;

SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', REPLACE(column_type, 'text', 'blob'), ';') FROM columns WHERE table_schema = 'MyDb' and data_type LIKE '%text%';

4. Run this statement below. It will generate sql statements to be executed later. Save the result in the same sql script file, below the sql statements generated earlier;

SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'MyDb' and data_type LIKE '%char%';

5. Run this statement below. It will generate sql statements to be executed later. Save the result in the same sql script file, below the sql statements generated earlier;

SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'MyDb' and data_type LIKE '%text%';

6. Add the line below to the sql script file;

ALTER DATABASE MyDb CHARACTER SET utf8;

7. Switch to MyDb by:

USE MyDb;

8. Run the script file generated from the above steps, and you should be good.

Enjoy! My procedure above is especially useful for upgrading databases from MySQL 4 and earlier to MySQL 5 and later, as discussed here. It appears that for MySQL 5 and later, it is possible to convert character set of all columns in one command, like:

ALTER TABLE wp_posts CONVERT TO CHARACTER SET utf8;

However, I didn’t test it so I am not sure.

By the way, if you are interested in collation in MySQL, especially for Chinese, I wrote a little note here.

32 Comments »

  1. HeavySword Said,

    November 18, 2008 @

    你这blog程序很强啊
    这代码高亮整得
    啧啧啊

  2. Haidong Ji Said,

    November 18, 2008 @

    我用了个SyntaxHighlighter Plus WordPress 插件。这个插件是建在Google code的syntaxhighlighter。这儿是网站,说不准你也可以用上。

    http://code.google.com/p/syntaxhighlighter/

  3. Andy Said,

    January 14, 2009 @

    This has been an on-going problem since the early days of MySQL, none of the advice or recommendations worked properly, even those suggested by MySQL!

    Your solution *WORKS*, it’s intuitive and easy to implement.

    This is a nicely written article. Thank you for sharing this information.

    Xie xie

    Andy

  4. Haidong Ji Said,

    January 14, 2009 @

    不用谢, you are welcome Andy. Glad it helped.

    I plan to write a Perl script to do this all automatically. That should make it even easier. When I do, I will link it here.

  5. alank Said,

    February 4, 2009 @

    Hello,

    I’ve started the process, but observe that the order is perhaps not correct. The var type is changed before the utf8 is introduced. The schema generates an ALTER statement which searches for the original column type in all cases. If the type is changed first, they get changed back on the next step. Please advise. Thanks for your contribution.

  6. alank Said,

    February 6, 2009 @

    I found an explanation here regarding the blob/varbinary technique implemented here. Now I understand the double-processing of candidate fields in the db tables.

  7. Haidong Ji Said,

    February 6, 2009 @

    Sorry for the late reply alank. Glad you found out the reason.

    I do want to make this an automated process, through Perl. Just haven’t got the time to do it yet.

  8. yejr Said,

    March 17, 2009 @

    我也尝试过直接转换数据表(table)还有列(columns)的字符集,从gbk/gb2312转到utf8,或者反向,但结果不行,当时是在5.0版本下做的。找时间再按照你的方法做一遍试试看。

  9. Haidong Ji Said,

    March 17, 2009 @

    用我写的这个方法应当可以。我也写了个简单的Perl脚本。
    http://www.haidongji.com/2009/02/16/perl-script-to-convert-mysql-character-set-to-utf8/

    但如果有很多数据的话,速度会慢。如果用Fulltext的索引,还会有其他麻烦。

    我和Ryan Lowe在电邮里讨论过这个问题,他的这篇博客就更深入,很有参考意义。

    http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/

    按:刚到你的网站逛逛,感觉有很多实用的信息,以加入我的Google Reader。叶大侠如果是苏北徐州附近的人,我们的方言都差不多。

  10. Converting Character Sets | WORDPRESS EXTENSIONS-PLUGINS-THEMES-TEMPLATES Said,

    March 18, 2009 @

    [...] Approach #2 (outlined here): [...]

  11. yejr Said,

    March 20, 2009 @

    呵呵,我老婆是苏北的,我是福建人 :)

  12. Haidong Ji Said,

    March 20, 2009 @

    我大学是在福建读的,厦门大学。今年夏天会故地重游。

  13. MySQL : преобразование кодировок | SysAdmin Online Said,

    March 30, 2009 @

    [...] Описан здесь. [...]

  14. alank Said,

    April 7, 2009 @

    Hello,

    I tried using this with a MediaWiki db, and ran into a spot of bother with in the searchindex table.

    For example, this line,

    ALTER TABLE searchindex MODIFY si_title varbinary(255);

    threw this error:

    ERROR 1283 (HY000) at line 108: Column ‘si_title’ cannot be part of FULLTEXT index

    I thought it might have been remedied by changing the collation of each table from latin1 to utf8. However, after that step, I still got the same error. I’m not certain what causes this stumble, but thought I’d share it here, at any rate. Any thoughts?

  15. Haidong Ji Said,

    April 8, 2009 @

    Hi Alan,

    Check out Ryan Lowe’s discussion of this topic here. He has a script written, and the comments in that entry is also interesting.

    Hope that helps. I would be interested in hearing how you solve your issue eventually.

    Good luck!

    http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/

  16. naitan Said,

    April 24, 2009 @

    could you say a few words about DATA WAREHOUSING? Any advice about how to acquire relevant skills? Thank you!

  17. Haidong Ji Said,

    April 25, 2009 @

    Naitan,

    Data Warehousing is a big topic. I suggest looking for books on Amazon and read reviews, then find the book that has the best review and read it.

    Cheers.

  18. Thijs Said,

    May 7, 2009 @

    Gee dude,
    Thanks for this post, this really helped me getting closer to a solution to my problem!

  19. Haidong Ji Said,

    May 7, 2009 @

    Anytime, man.

  20. Konversi MySQL ke UTF-8 Selesai | Aman Personal Website Said,

    July 22, 2009 @

    [...] harus dilalui dalam proses konversi ke UTF-8 bagi database yang sudah berisi banyak data. Untungnya di laman inidipaparkan cara yang lebih [...]

  21. louie Said,

    September 24, 2009 @

    Thanks for this solution to converting a mysql database to utf8 it rocks.

    had one slight problem i had a table called interval , so altered the sql slightly to quote around it like so `interval`

    then worked perfectly :)

  22. Haidong Ji Said,

    September 24, 2009 @

    You are welcome, Louie. Cheers!

  23. Tomas Said,

    October 17, 2009 @

    hi haidong,

    I have a question, your scripts seems can not solve the problem of foreign key with ‘char’ column inside, any suggestion ?

  24. Daniel Said,

    October 18, 2009 @

    Hi Haidong Ji,

    I adapted your script for converting a latin1 WP MU site into UTF8 and it worked perfectly. Many thanks for sharing your knowledge –and beside, it’s a beautiful solution;-)

  25. Haidong Ji Said,

    October 18, 2009 @

    Tomas,

    Not too sure what you mean by that. Short of having access to the server and files, it is hard to see where exactly the problem is.

    Daniel,

    Sharing with and benefiting from each other is my goal. I am really glad it helped. Cheers!

  26. ипотпал Said,

    November 24, 2009 @

    Very good articles ) :P

  27. Sanitizing Wordpress UTF-8 – or Howto get rid of mixed Latin1 and UTF8 mysql exports / Joes Blog Said,

    December 18, 2009 @

    [...] Actually my atantion comes to some weird characters in my wordpress blog. Such as ü or ö as represantants of ä and ö. So i had a look into my mysql-db and saw that it was still on latin1. On my way to the clearance i got over that explanation. But still all howtos doesn’t work out for me. There are others allready had a look into it like fischerlander or Ji’s Blog. [...]

  28. Joe Said,

    December 18, 2009 @

    Hi, thx for your work – it was a starter but the problem is more complex as the browsersettings can ignore the websites needs. So if you set your browser not to autorecognize the websites coding – rather than to a fixed setting this will lead also after a clean-up to a messed database. So i think it should be fixed in the post-message in WordPress.

    yours Joe

  29. Haidong Ji Said,

    December 18, 2009 @

    Cheers, Joe.

    Yeah, if a browser forces a particular encoding, it will display gibberish.

  30. WordPress 2.9 Illegal Mix of Collations in Database | semioticpixels scratch pad Said,

    March 4, 2010 @

    [...] following 4 queries are lifted directly from Haidong Ji via the WordPress Codex. I added the pipe out to text file. Note that the path to mysql executable [...]

  31. Joe Said,

    March 31, 2010 @

    Hey,
    adding this to the comment-template helped:
    form accept-charset=”UTF-8″

    but it is still not final. Every modern Browser except IE<8 accept this – so it is an IE/old-Browser-Issue. I am looking for some Java-Script to get the Browsers settings but was unlucky yet.

  32. Nhut Nguyen Said,

    September 17, 2011 @

    OK, i has just successfull

RSS feed for comments on this post · TrackBack URI

Leave a Comment