[wp-trac] [WordPress Trac] #36393: Loss of multibyte comment author names
WordPress Trac
noreply at wordpress.org
Fri Apr 1 07:04:35 UTC 2016
#36393: Loss of multibyte comment author names
--------------------------+-----------------------------
Reporter: cfinke | Owner:
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Comments | Version: trunk
Severity: normal | Keywords:
Focuses: |
--------------------------+-----------------------------
Some multibyte comment author names can be lost during comment submission.
Example: consider a comment authored by a user named `テテテテテテテテテテテテテテテテテテテテ
テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ`. This
name is a 258-byte string, longer than the maximum length of the
`comment_author` column. `$wpdb->strip_invalid_text_for_column()` will
truncate it to 255 bytes, and because each character is three bytes, the
string is still "valid," albeit one character shorter.
After `$wpdb->strip_invalid_text_for_column()` runs,
`sanitize_text_field()` will run, which calls `wp_check_invalid_utf8()`,
which will do nothing, because the string is still valid utf8.
If this commenter's older sister, `Aテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ`, also tries to comment,
the result is very different. This name is a 259 byte string.
`$wpdb->strip_invalid_text_for_column()` will truncate it to 255 bytes,
taking off one character and 1/3 of another. When
`wp_check_invalid_utf8()` gets called, it will truncate the string to zero
bytes out of an abundance of caution, since the string ends with something
that is not valid utf8.
It's clear that the commenter was not submitting invalid utf8, and the
true goal of `$wpdb->strip_invalid_text_for_column()` was to ensure that
the text would fit in the DB column without auto-truncation by the DB
engine, so the ideal behavior should be that the string is truncated to
the longest possible length that remains valid and fits within the column.
One way to get around this data loss would be a wrapper around
`wp_check_invalid_utf8()`. If `wp_check_invalid_utf8()` fails, chop a
single byte off the end of the string and check it again, up to the point
where you have checked the string without the last five bytes (as I
believe that the longest a single character can be is six bytes, although
I'm not positive about that and I think anything longer than four bytes is
mostly theoretical). Or, fix `$wpdb->strip_invalid_text_for_column()` so
that it doesn't truncate in the middle of a multibyte character.
Configuration details: Tested in both WordPress 4.4.2 and trunk
(4.5-RC1-37153); PHP 5.2.17
I noticed this issue in regards to commenter names, so here's the
structure of my comments DB table (created in 2006, FWIW):
{{{
CREATE TABLE `wp_comments` (
`comment_ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`comment_post_ID` bigint(20) unsigned NOT NULL DEFAULT '0',
`comment_author` tinytext NOT NULL,
`comment_author_email` varchar(100) NOT NULL DEFAULT '',
`comment_author_url` varchar(200) NOT NULL DEFAULT '',
`comment_author_IP` varchar(100) NOT NULL DEFAULT '',
`comment_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`comment_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`comment_content` text NOT NULL,
`comment_karma` int(11) NOT NULL DEFAULT '0',
`comment_approved` varchar(20) NOT NULL DEFAULT '1',
`comment_agent` varchar(255) NOT NULL DEFAULT '',
`comment_type` varchar(20) NOT NULL DEFAULT '',
`comment_parent` bigint(20) unsigned NOT NULL DEFAULT '0',
`user_id` bigint(20) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`comment_ID`),
KEY `comment_post_ID` (`comment_post_ID`),
KEY `comment_approved_date_gmt` (`comment_approved`,`comment_date_gmt`),
KEY `comment_date_gmt` (`comment_date_gmt`),
KEY `comment_parent` (`comment_parent`),
KEY `comment_author_email` (`comment_author_email`(10))
) ENGINE=MyISAM AUTO_INCREMENT=2130254 DEFAULT CHARSET=latin1;
}}}
In case the strings I used as example commenter names above get mangled,
here are their base64 encodings:
commenter1: string(344)
"776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D"
commenter2: string(348)
"Qe++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++gw=="
I'm attaching a POC plugin that manually walks through how the commenter
name gets handled in the comment submission process (but only when the
first attempt to save the comment fails and then requires the
`$wpdb->strip_invalid_text_for_column()` call).
--
Ticket URL: <https://core.trac.wordpress.org/ticket/36393>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list