[wp-trac] Re: [WordPress Trac] #3517: WordPress should be 100% UTF-8

WordPress Trac wp-trac at lists.automattic.com
Tue Jan 30 22:03:12 GMT 2007


#3517: WordPress should be 100% UTF-8
---------------------+------------------------------------------------------
 Reporter:  sehh     |        Owner:  anonymous
     Type:  defect   |       Status:  new      
 Priority:  normal   |    Milestone:  2.2      
Component:  General  |      Version:  2.0.5    
 Severity:  major    |   Resolution:           
 Keywords:  UTF-8    |  
---------------------+------------------------------------------------------
Old description:

> WP is running in semi-unicode and ascii/latin mode. As a result, people
> with weird languages that require UTF-8 character sets are having major
> problems. The issue isn't easily detectable, since storing and retreiving
> UTF-8 data to an SQL database with latin character set seems to work.
> Unfortuantely, it doesn't really work. WP can store UTF-8 data on a
> database/table/field with latin character set, but all SQL-based text
> functions return wrong values.
>
> For example: SORTING, COMPARING, MANIPULATING of any string returns
> invalid data (not sorted properly, etc). Its about time WP started using
> UTF-8 everywhere.
>
> The change to UTF-8 isn't simple. Some people thing that they can just
> "ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that
> they'll be fine, WRONG!
>
> For a new installation, its rather easy:
>
> 1) All database and table definitions must be set to UTF-8, some
> examples:[[br]]
> create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE
> utf8_general_ci;[[br]]
> create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE
> utf8_general_ci;
>
> 2) Modify the WP database connection to execute the following:[[br]]
> SET NAMES utf8;[[br]]
> SET COLLATION_CONNECTION=utf8_general_ci;
>
> Thats about it, a new installation can easily run with full UTF-8 support
> without any more changes.
>

> Now, how about upgrading from an existing database? Thats more complex.
> Read this carefuly:
>
> When doing an ALTER TABLE to change the character set, all TEXT (and
> similar) fields are converted to UTF-8. The conversion BREAKS existing
> text because the conversion expects the data to be in Latin, but they are
> not since WP has stored unicode characters in a latin database, as a
> result we get garbage after the conversion!
>
> The solution is to ALTER all TEXT and related fields to BLOB, then alter
> the character set and finaly change back the BLOB fields to TEXT.
>
> Example steps:
>
> 1) ALTER TABLE users MODIFY Last_Name BLOB; [[br]]
> 2) ALTER DATABASE wordpress charset=utf8; [[br]]
> 3) ALTER TABLE users charset=utf8; [[br]]
> 4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;
>
> so, we change our text fields to BLOB, switch our database and tables to
> UTF-8 and finaly in one go we return our initial TEXT fields and switch
> them to UTF-8.
>
> the key here is that a BLOB field will not be converted to garbage when
> switched to UTF-8, unlike a TEXT field.
>
> Hopefuly, the developers of WP will be able to create a conversion script
> to upgrade old latin databases.

New description:

 WP is running in semi-unicode and ascii/latin mode. As a result, people
 with weird languages that require UTF-8 character sets are having major
 problems. The issue isn't easily detectable, since storing and retreiving
 UTF-8 data to an SQL database with latin character set seems to work.
 Unfortuantely, it doesn't really work. WP can store UTF-8 data on a
 database/table/field with latin character set, but all SQL-based text
 functions return wrong values.

 For example: SORTING, COMPARING, MANIPULATING of any string returns
 invalid data (not sorted properly, etc). Its about time WP started using
 UTF-8 everywhere.

 The change to UTF-8 isn't simple. Some people thing that they can just
 "ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that they'll
 be fine, WRONG!

 For a new installation, its rather easy:

 1) All database and table definitions must be set to UTF-8, some
 examples:[[br]]
 create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE
 utf8_general_ci;[[br]]
 create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE
 utf8_general_ci;

 2) Modify the WP database connection to execute the following:[[br]]
 SET NAMES utf8;[[br]]
 SET COLLATION_CONNECTION=utf8_general_ci;

 Thats about it, a new installation can easily run with full UTF-8 support
 without any more changes.


 Now, how about upgrading from an existing database? Thats more complex.
 Read this carefuly:

 When doing an ALTER TABLE to change the character set, all TEXT (and
 similar) fields are converted to UTF-8. The conversion BREAKS existing
 text because the conversion expects the data to be in Latin, but they are
 not since WP has stored unicode characters in a latin database, as a
 result we get garbage after the conversion!

 The solution is to ALTER all TEXT and related fields to BLOB, then alter
 the character set and finaly change back the BLOB fields to TEXT.

 Example steps:

 1) ALTER TABLE users MODIFY Last_Name BLOB; [[br]]
 2) ALTER DATABASE wordpress charset=utf8; [[br]]
 3) ALTER TABLE users charset=utf8; [[br]]
 4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;

 so, we change our text fields to BLOB, switch our database and tables to
 UTF-8 and finaly in one go we return our initial TEXT fields and switch
 them to UTF-8.

 the key here is that a BLOB field will not be converted to garbage when
 switched to UTF-8, unlike a TEXT field.

 Hopefuly, the developers of WP will be able to create a conversion script
 to upgrade old latin databases.

 Some of the related tickets: #2828, #2942, #3184

-- 
Ticket URL: <http://trac.wordpress.org/ticket/3517#comment:9>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list