[wp-trac] [WordPress Trac] #32165: wp-db.php destructs all the multibyte characters

WordPress Trac noreply at wordpress.org
Tue Apr 28 15:01:51 UTC 2015

#32165: wp-db.php destructs all the multibyte characters
 Reporter:  kjmtsh        |      Owner:
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  Database      |    Version:  4.2.1
 Severity:  normal        |   Keywords:
  Focuses:                |
 Many of the users in Japan reported that they lost their post data from
 the malfunction of WordPress 4.1.2 or later (4.1.3, 4.2 and 4.2.1
 included). Some of them are forced to refrain from upgrading WordPress.

 I was informed of MySQL settings from one of reporters, reproduced the
 environment and made sure of the cause, which might cause the data
 destruction even in the environment other than Japanese language. I'd like
 you to fix it and strongly recommend you to release a patched version as
 soon as possible.

 This is what's really going on. As you know, WordPress works in a UTF-8
 enviroment. It gets the text input data encoded in UTF-8, such as posts,
 comments and all. Our reporters' settings are the same. But their MySQL
 server settings are not. Theirs are like below:

 MySQL server setings:

 * character_set_client: utf8
 * character_set_connection: utf8
 * character_set_server: utf8

 MySQL database and table settings:

 * WordPress default database character set: ujis
 * WordPress tables character set: ujis
 * Collation of both: ujis_japanese_ci

 This enviromnet appears to grow historically. That is to say, they built
 their sites when MySQL didn't satisfactedly support utf8. They adopted
 ujis to database character set, which was the only option available. While
 repeadedly updated, MySQL has become a modern database and begun to fully
 support utf8. Most of the hosting companies updated MySQL with the newly
 written my.cnf. Server settings are changed, but old veterans' databases
 and tables filled with data left untouched.

 Under this given condition, wp-db.php in newly released WordPress tries to
 manipulate text data like below:

 If character set of the table is ujis, it fires mb_convert_encoding() with
 this character set. This is the line.

 $value['value'] = mb_convert_encoding($value['value'],
 $mb_charsets[$charset], $mb_charset[$charset]);

 This destructs users' posts data into unrecoverable garbage.
 {{{$mb_charsets[$charset]}}} comes from database settings, But
 {{{$value['value']}}} is encoded with UTF-8. This line does the same thing
 as below to UTF-8 encoded string.

 $value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'EUC-

 As a result, all the characters except ascii will suffer irreversible
 conversion. I think this is not what you want to do. You might think we'd
 better change the line to this.

 $value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'UTF-8');


 $value['value'] = mb_convert_encoding($value['value'], 'EUC-JP');

 It appears to be better. But NO. These change don't work and cause the
 same destruction. The reason is that the connection between WordPress and
 MySQL uses UTF-8 character set and input/outpu stream line is encoded in
 UTF-8. In this case, WordPress forces EUC-JP encoded data into the UTF-8
 stream, which data MySQL can't interpret correctly (it changes all the
 multibyte characters to '?').

 If database/table character set is different from the one used in the
 server-client connection, you've got to use the latter one, which is
 required by TCP/IP and UNIX socket connection. Otherwise, you'll put the
 data integration in danger and prevent MySQL database engine from
 converting them appropriately.

 So here is my patch.

 --- wp-db_orig.php      2015-04-28 12:16:52.037000000 +0900
 +++ wp-db.php   2015-04-28 22:13:20.465000000 +0900
 @@ -2590,7 +2590,11 @@
         $db_check_string = false;

         foreach ( $data as &$value ) {
 -            $charset = $value['charset'];
 +            if ($value['charset'] !== $this->charset) {
 +                $charset = $this->charset;
 +            } else {
 +                $charset = $value['charset'];
 +            }

              // Column isn't a string, or is latin1, which will will
 happily store anything.
              if ( false === $charset || 'latin1' === $charset ) {

Ticket URL: <https://core.trac.wordpress.org/ticket/32165>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform

More information about the wp-trac mailing list