[wp-trac] [WordPress Trac] #3517: 100% UTF-8
WordPress Trac
wp-trac at lists.automattic.com
Tue Jan 2 02:07:15 GMT 2007
#3517: 100% UTF-8
---------------------+------------------------------------------------------
Reporter: sehh | Owner: anonymous
Type: defect | Status: new
Priority: normal | Milestone: 2.0.7
Component: General | Version: 2.0.5
Severity: major | Keywords: UTF-8
---------------------+------------------------------------------------------
WP is running in semi-unicode and ascii/latin mode. As a result, people
with weird languages that require UTF-8 character sets are having major
problems. The issue isn't easily detectable, since storing and retreiving
UTF-8 data to an SQL database with latin character set seems to work.
Unfortuantely, it doesn't really work. WP can store UTF-8 data on a
database/table/field with latin character set, but all SQL-based text
functions return wrong values.
For example: SORTING, COMPARING, MANIPULATING of any string returns
invalid data (not sorted properly, etc). Its about time WP started using
UTF-8 everywhere.
The change to UTF-8 isn't simple. Some people thing that they can just
"ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that they'll
be fine, WRONG!
For a new installation, its rather easy:
1) All database and table definitions must be set to UTF-8, some examples:
create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE
utf8_general_ci;
create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE
utf8_general_ci;
2) Modify the WP database connection to execute the following:
SET NAMES utf8;
SET COLLATION_CONNECTION=utf8_general_ci;
Thats about it, a new installation can easily run with full UTF-8 support
without any more changes.
Now, how about upgrading from an existing database? Thats more complex.
Read this carefuly:
When doing an ALTER TABLE to change the character set, all TEXT (and
similar) fields are converted to UTF-8. The conversion BREAKS existing
text because the conversion expects the data to be in Latin, but they are
not since WP has stored unicode characters in a latin database, as a
result we get garbage after the conversion!
The solution is to ALTER all TEXT and related fields to BLOB, then alter
the character set and finaly change back the BLOB fields to TEXT.
Example steps:
1) ALTER TABLE users MODIFY Last_Name BLOB;
2) ALTER DATABASE wordpress charset=utf8;
3) ALTER TABLE users charset=utf8;
4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;
so, we change our text fields to BLOB, switch our database and tables to
UTF-8 and finaly in one go we return our initial TEXT fields and switch
them to UTF-8.
the key here is that a BLOB field will not be converted to garbage when
switched to UTF-8, unlike a TEXT field.
Hopefuly, the developers of WP will be able to create a conversion script
to upgrade old latin databases.
--
Ticket URL: <http://trac.wordpress.org/ticket/3517>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list