[wp-trac] [WordPress Trac] #32165: wp-db.php destructs all the multibyte characters
WordPress Trac
noreply at wordpress.org
Tue Apr 28 15:01:51 UTC 2015
#32165: wp-db.php destructs all the multibyte characters
--------------------------+-----------------------------
Reporter: kjmtsh | Owner:
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Database | Version: 4.2.1
Severity: normal | Keywords:
Focuses: |
--------------------------+-----------------------------
Many of the users in Japan reported that they lost their post data from
the malfunction of WordPress 4.1.2 or later (4.1.3, 4.2 and 4.2.1
included). Some of them are forced to refrain from upgrading WordPress.
I was informed of MySQL settings from one of reporters, reproduced the
environment and made sure of the cause, which might cause the data
destruction even in the environment other than Japanese language. I'd like
you to fix it and strongly recommend you to release a patched version as
soon as possible.
This is what's really going on. As you know, WordPress works in a UTF-8
enviroment. It gets the text input data encoded in UTF-8, such as posts,
comments and all. Our reporters' settings are the same. But their MySQL
server settings are not. Theirs are like below:
MySQL server setings:
* character_set_client: utf8
* character_set_connection: utf8
* character_set_server: utf8
MySQL database and table settings:
* WordPress default database character set: ujis
* WordPress tables character set: ujis
* Collation of both: ujis_japanese_ci
This enviromnet appears to grow historically. That is to say, they built
their sites when MySQL didn't satisfactedly support utf8. They adopted
ujis to database character set, which was the only option available. While
repeadedly updated, MySQL has become a modern database and begun to fully
support utf8. Most of the hosting companies updated MySQL with the newly
written my.cnf. Server settings are changed, but old veterans' databases
and tables filled with data left untouched.
Under this given condition, wp-db.php in newly released WordPress tries to
manipulate text data like below:
If character set of the table is ujis, it fires mb_convert_encoding() with
this character set. This is the line.
{{{
$value['value'] = mb_convert_encoding($value['value'],
$mb_charsets[$charset], $mb_charset[$charset]);
}}}
This destructs users' posts data into unrecoverable garbage.
{{{$mb_charsets[$charset]}}} comes from database settings, But
{{{$value['value']}}} is encoded with UTF-8. This line does the same thing
as below to UTF-8 encoded string.
{{{
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'EUC-
JP');
}}}
As a result, all the characters except ascii will suffer irreversible
conversion. I think this is not what you want to do. You might think we'd
better change the line to this.
{{{
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'UTF-8');
}}}
or
{{{
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP');
}}}
It appears to be better. But NO. These change don't work and cause the
same destruction. The reason is that the connection between WordPress and
MySQL uses UTF-8 character set and input/outpu stream line is encoded in
UTF-8. In this case, WordPress forces EUC-JP encoded data into the UTF-8
stream, which data MySQL can't interpret correctly (it changes all the
multibyte characters to '?').
If database/table character set is different from the one used in the
server-client connection, you've got to use the latter one, which is
required by TCP/IP and UNIX socket connection. Otherwise, you'll put the
data integration in danger and prevent MySQL database engine from
converting them appropriately.
So here is my patch.
{{{
--- wp-db_orig.php 2015-04-28 12:16:52.037000000 +0900
+++ wp-db.php 2015-04-28 22:13:20.465000000 +0900
@@ -2590,7 +2590,11 @@
$db_check_string = false;
foreach ( $data as &$value ) {
- $charset = $value['charset'];
+ if ($value['charset'] !== $this->charset) {
+ $charset = $this->charset;
+ } else {
+ $charset = $value['charset'];
+ }
// Column isn't a string, or is latin1, which will will
happily store anything.
if ( false === $charset || 'latin1' === $charset ) {
}}}
--
Ticket URL: <https://core.trac.wordpress.org/ticket/32165>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list