[wp-trac] [WordPress Trac] #36610: Loss of multibyte category and tag names
WordPress Trac
noreply at wordpress.org
Wed Apr 20 22:40:42 UTC 2016
#36610: Loss of multibyte category and tag names
--------------------------+-----------------------------
Reporter: cfinke | Owner:
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Taxonomy | Version: trunk
Severity: normal | Keywords:
Focuses: |
--------------------------+-----------------------------
Some multibyte category and tag names can be lost during creation.
Example: create a category with the name `テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAAA`. It is 201 bytes long and will be
truncated by `$wpdb->strip_invalid_text_for_column()` to 200 bytes (`テテテテテ
テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAA`) before
the category is created.
However, the category name `AAAテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
テテテテテテテテテテテテテテテテテテテテテテテ` is also 201 bytes, but when it is truncated to
200 bytes, it splits a multibyte character, so when
`wp_check_invalid_utf8()` gets called, it will truncate the string to zero
bytes out of an abundance of caution, since the string ends with something
that is not valid utf8.
It's clear that the category creator was not submitting invalid utf8, and
the true goal of `$wpdb->strip_invalid_text_for_column()` was to ensure
that the text would fit in the DB column without auto-truncation by the DB
engine, so the ideal behavior should be that the string is truncated to
the longest possible length that remains valid and fits within the column.
One way to get around this data loss would be a wrapper around
`wp_check_invalid_utf8()`. If `wp_check_invalid_utf8()` fails, chop a
single byte off the end of the string and check it again, up to the point
where you have checked the string without the last five bytes (as I
believe that the longest a single character can be is six bytes, although
I'm not positive about that and I think anything longer than four bytes is
mostly theoretical). Or, fix `$wpdb->strip_invalid_text_for_column()` so
that it doesn't truncate in the middle of a multibyte character.
There might be a solution lurking in mb_strlen(). If
`wp_check_invalid_utf8()` returns an empty string, take bytes off of the
original string (up to 5 bytes) until `mb_strlen()` returns a smaller
number and then try `wp_check_invalid_utf8()`.
Configuration details: Tested in WordPress trunk (4.5-RC1-37153) and PHP
5.2.17
Here's my `wp_terms` structure:
{{{
CREATE TABLE `wp_terms` (
`term_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(200) NOT NULL DEFAULT '',
`slug` varchar(200) NOT NULL DEFAULT '',
`term_group` bigint(10) NOT NULL DEFAULT '0',
PRIMARY KEY (`term_id`),
KEY `slug` (`slug`(191)),
KEY `name` (`name`(191))
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
}}}
See #36393 for discussion of a similar (but now-fixed) bug.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/36610>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list