[wp-trac] [WordPress Trac] #3517: 100% UTF-8

WordPress Trac wp-trac at lists.automattic.com
Tue Jan 2 02:07:15 GMT 2007


#3517: 100% UTF-8
---------------------+------------------------------------------------------
 Reporter:  sehh     |       Owner:  anonymous
     Type:  defect   |      Status:  new      
 Priority:  normal   |   Milestone:  2.0.7    
Component:  General  |     Version:  2.0.5    
 Severity:  major    |    Keywords:  UTF-8    
---------------------+------------------------------------------------------
 WP is running in semi-unicode and ascii/latin mode. As a result, people
 with weird languages that require UTF-8 character sets are having major
 problems. The issue isn't easily detectable, since storing and retreiving
 UTF-8 data to an SQL database with latin character set seems to work.
 Unfortuantely, it doesn't really work. WP can store UTF-8 data on a
 database/table/field with latin character set, but all SQL-based text
 functions return wrong values.

 For example: SORTING, COMPARING, MANIPULATING of any string returns
 invalid data (not sorted properly, etc). Its about time WP started using
 UTF-8 everywhere.

 The change to UTF-8 isn't simple. Some people thing that they can just
 "ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that they'll
 be fine, WRONG!

 For a new installation, its rather easy:

 1) All database and table definitions must be set to UTF-8, some examples:
 create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE
 utf8_general_ci;
 create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE
 utf8_general_ci;

 2) Modify the WP database connection to execute the following:
 SET NAMES utf8;
 SET COLLATION_CONNECTION=utf8_general_ci;

 Thats about it, a new installation can easily run with full UTF-8 support
 without any more changes.


 Now, how about upgrading from an existing database? Thats more complex.
 Read this carefuly:

 When doing an ALTER TABLE to change the character set, all TEXT (and
 similar) fields are converted to UTF-8. The conversion BREAKS existing
 text because the conversion expects the data to be in Latin, but they are
 not since WP has stored unicode characters in a latin database, as a
 result we get garbage after the conversion!

 The solution is to ALTER all TEXT and related fields to BLOB, then alter
 the character set and finaly change back the BLOB fields to TEXT.

 Example steps:

 1) ALTER TABLE users MODIFY Last_Name BLOB;
 2) ALTER DATABASE wordpress charset=utf8;
 3) ALTER TABLE users charset=utf8;
 4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;

 so, we change our text fields to BLOB, switch our database and tables to
 UTF-8 and finaly in one go we return our initial TEXT fields and switch
 them to UTF-8.

 the key here is that a BLOB field will not be converted to garbage when
 switched to UTF-8, unlike a TEXT field.

 Hopefuly, the developers of WP will be able to create a conversion script
 to upgrade old latin databases.

-- 
Ticket URL: <http://trac.wordpress.org/ticket/3517>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list