[wp-trac] [WordPress Trac] #32165: wp-db.php destructs all the multibyte characters
WordPress Trac
noreply at wordpress.org
Wed Apr 29 14:25:13 UTC 2015
#32165: wp-db.php destructs all the multibyte characters
--------------------------+--------------------
Reporter: kjmtsh | Owner:
Type: defect (bug) | Status: new
Priority: high | Milestone: 4.2.2
Component: Database | Version: 4.1.2
Severity: blocker | Resolution:
Keywords: | Focuses:
--------------------------+--------------------
Comment (by kjmtsh):
@azaozz Thank you for commenting.
To test this case, you have to newly create database whose character set
is different from MySQL server settings. Simply to convert utf8 character
set table to ujis character set may not work and destroy the stored data.
If a user's database connectin character set is different from database or
table character set, we have no options which to take. Data flow is like
below.
{{{
input: WordPress -(utf8)-> database engine -(ujis)-> each table
output: each table -(ujis)-> database engine -(utf8)-> WordPress
}}}
If the constant DB_CHARSET defined in wp-config.php is set to utf8, the
connection from WordPress to database engine is encoded in UTF-8. But when
database engine writes to the table, it uses table character set (ujis in
our case). WordPress must not give EUC-JP encoded string to database
engine. When this happens, users' data will be destroyed at this stage.
Outputs are no problem. But there might be a problem for inputs. UTF-8 has
much more characters than EUC-JP. So if input data contains some
characters imcompatible with EUC-JP, MySQL change those characters to '?'
(question mark). But we can do nothing for that.
{{{strip_invalid_text()}}} tries to remove the incompatible characters
with table/column character set. It will succeed when and only when
database connection character set and table/column character set are the
same. Otherwise fails. In this situation, we can only use database
connection character set. We must not use table/column character set.
If we convert the data from UTF-8 to EUC-JP, and reconvert them from EUC-
JP to UTF-8? No. For example, take some UTF-8 encoded Korean characters
incompatible with EUC-JP Japanese character set.
{{{
$input = mb_convert_encoding("한글", 'EUC-JP', 'UTF-8');
$input = mb_convert_encoding($input, 'UTF-8', 'EUC-JP');
echo $input;
}}}
We only get '??' (double question marks). The first conversion change the
characters to '??' and the second one does nothing. What is worse, two
Korean characters are not removed. This is not the result this function
tries to get. I don't think it's not a good idea to use
mb_convert_encoding() here, because this function is just for conversion,
not for removal of the characters.
We can't have users set DB_CHARSET to ujis, because MySQL server settings
requires utf8 (many of them don't have the previlege to change server
settings, either).
Possible solutions:
1. Under such condition, we give up using {{{strip_invalid_text()}}}
function. My patch does this.
2. Remove local conversion methods from wp-db.php.
3. Do nothing.
Possible problems:
1. Even if we don't use {{{strip_invalid_text()}}}, there remains possible
danger that the characters incompatible with EUC-JP will send to MySQL.
Most of the users in Japanese language probably use only Japanese. This is
why they have few problems until now. The change in wp-db.php only brings
the virtual problem to the real one.[[BR]]It appears to me that if we want
to avoid possible problem, we have no options except to recommend such
users to convert their database/table/column character set to utf8. I know
this operation is not easy, especially for non-programmers or programmers
not aquainted with MySQL.[[BR]]I think some day WordPress will inevitably
stop supporting other character set than UTF-8. Now is the time? Or it is
the beginning of the end?
2. Even if users' database character set is utf8,
{{{strip_invalid_text()}}} has another possible problem I didn't mention
in the above ticket. It checks UTF-8 character with the PHP's regular
expression, which means that posts, comments, pingback etc go through this
function. In this case, what if WordPress get pingback in some European
language (like German or French) encoded in ISO-8859-1? German umlaut or
French accent ague is silently removed. This will always happen when
WordPress gets non UTF-8 encoded characters.[[BR]]This is the same as we
say WordPress doesn't support any other character set than utf8.
3. If we take this solution, we have to let all the users know about it
that WordPress will only support UTF-8 character set for the database,
because otherwise we can't ensure the safety of the input data. This is
the easiest way. We don't have to care about other character sets. But
users are in difficult situation.
I recommend to take number 1 solution for the time being, and start to
prepare to stop supporting any other character set than utf8.
Thank you.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/32165#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list