[wp-trac] [WordPress Trac] #32165: wp-db.php destructs all the multibyte characters

WordPress Trac noreply at wordpress.org
Wed Apr 29 14:25:13 UTC 2015

#32165: wp-db.php destructs all the multibyte characters
 Reporter:  kjmtsh        |       Owner:
     Type:  defect (bug)  |      Status:  new
 Priority:  high          |   Milestone:  4.2.2
Component:  Database      |     Version:  4.1.2
 Severity:  blocker       |  Resolution:
 Keywords:                |     Focuses:

Comment (by kjmtsh):

 @azaozz Thank you for commenting.

 To test this case, you have to newly create database whose character set
 is different from MySQL server settings. Simply to convert utf8 character
 set table to ujis character set may not work and destroy the stored data.

 If a user's database connectin character set is different from database or
 table character set, we have no options which to take. Data flow is like

 input:  WordPress -(utf8)-> database engine -(ujis)-> each table
 output: each table -(ujis)-> database engine -(utf8)-> WordPress

 If the constant DB_CHARSET defined in wp-config.php is set to utf8, the
 connection from WordPress to database engine is encoded in UTF-8. But when
 database engine writes to the table, it uses table character set (ujis in
 our case). WordPress must not give EUC-JP encoded string to database
 engine. When this happens, users' data will be destroyed at this stage.

 Outputs are no problem. But there might be a problem for inputs. UTF-8 has
 much more characters than EUC-JP. So if input data contains some
 characters imcompatible with EUC-JP, MySQL change those characters to '?'
 (question mark). But we can do nothing for that.

 {{{strip_invalid_text()}}} tries to remove the incompatible characters
 with table/column character set. It will succeed when and only when
 database connection character set and table/column character set are the
 same. Otherwise fails. In this situation, we can only use database
 connection character set. We must not use table/column character set.

 If we convert the data from UTF-8 to EUC-JP, and reconvert them from EUC-
 JP to UTF-8? No. For example, take some UTF-8 encoded Korean characters
 incompatible with EUC-JP Japanese character set.

 $input = mb_convert_encoding("한글", 'EUC-JP', 'UTF-8');
 $input = mb_convert_encoding($input, 'UTF-8', 'EUC-JP');
 echo $input;

 We only get '??' (double question marks). The first conversion change the
 characters to '??' and the second one does nothing. What is worse, two
 Korean characters are not removed. This is not the result this function
 tries to get. I don't think it's not a good idea to use
 mb_convert_encoding() here, because this function is just for conversion,
 not for removal of the characters.

 We can't have users set DB_CHARSET to ujis, because MySQL server settings
 requires utf8 (many of them don't have the previlege to change server
 settings, either).

 Possible solutions:

 1. Under such condition, we give up using {{{strip_invalid_text()}}}
 function. My patch does this.
 2. Remove local conversion methods from wp-db.php.
 3. Do nothing.

 Possible problems:

 1. Even if we don't use {{{strip_invalid_text()}}}, there remains possible
 danger that the characters incompatible with EUC-JP will send to MySQL.
 Most of the users in Japanese language probably use only Japanese. This is
 why they have few problems until now. The change in wp-db.php only brings
 the virtual problem to the real one.[[BR]]It appears to me that if we want
 to avoid possible problem, we have no options except to recommend such
 users to convert their database/table/column character set to utf8. I know
 this operation is not easy, especially for non-programmers or programmers
 not aquainted with MySQL.[[BR]]I think some day WordPress will inevitably
 stop supporting other character set than UTF-8. Now is the time? Or it is
 the beginning of the end?

 2. Even if users' database character set is utf8,
 {{{strip_invalid_text()}}} has another possible problem I didn't mention
 in the above ticket. It checks UTF-8 character with the PHP's regular
 expression, which means that posts, comments, pingback etc go through this
 function. In this case, what if WordPress get pingback in some European
 language (like German or French) encoded in ISO-8859-1? German umlaut or
 French accent ague is silently removed. This will always happen when
 WordPress gets non UTF-8 encoded characters.[[BR]]This is the same as we
 say WordPress doesn't support any other character set than utf8.

 3. If we take this solution, we have to let all the users know about it
 that WordPress will only support UTF-8 character set for the database,
 because otherwise we can't ensure the safety of the input data. This is
 the easiest way. We don't have to care about other character sets. But
 users are in difficult situation.

 I recommend to take number 1 solution for the time being, and start to
 prepare to stop supporting any other character set than utf8.

 Thank you.

Ticket URL: <https://core.trac.wordpress.org/ticket/32165#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform

More information about the wp-trac mailing list