[wp-trac] [WordPress Trac] #13590: Inserting a 4-byte UTF-8 character truncates data (was: Inserting a tetragram (SMP/Plane 1) character truncates post fields)

WordPress Trac wp-trac at lists.automattic.com
Wed Jan 19 09:32:24 UTC 2011


#13590: Inserting a 4-byte UTF-8 character truncates data
--------------------------+-----------------------
 Reporter:  sardisson     |       Owner:
     Type:  defect (bug)  |      Status:  reopened
 Priority:  normal        |   Milestone:
Component:  Database      |     Version:  3.0.4
 Severity:  normal        |  Resolution:
 Keywords:  utf8          |
--------------------------+-----------------------
Changes (by aercolino):

 * status:  closed => reopened
 * cc: aercolino (added)
 * component:  Charset => Database
 * version:  2.9.2 => 3.0.4
 * keywords:   => utf8
 * resolution:  invalid =>


Comment:

 I've recently developed a class for escaping / unescaping UTF-8
 characters. I've released it as a Zend Framework class
 ([http://framework.zend.com/wiki/display/ZFPROP/Zend_Utf8+-+Andrea+Ercolino
 Zend_Utf8], now in the proposal stage), and as stand-alone class for a
 WordPress plugin ([http://wordpress.org/extend/plugins/full-utf-8/ Full
 UTF-8]).

 The plugin was meant to fix the same issue of this ticket, which I
 stumbled upon some days ago when trying to write an
 [http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
 article] about the [http://www.ietf.org/rfc/rfc4627.txt RFC4627 (JSON)].
 The plugin works pretty well for post content (and title, excerpt and
 search) but it doesn't cover custom fields. For them I had to write a
 patch that changed 8 different files. Anyway, a plugin + a patch is not a
 clean solution. And it's possible that some data string get's through to
 the db, following an alternative path I couldn't find with my reverse
 engineering.

 So I thought: Why not to wrap db queries inside escape / unescape
 parentheses? In this way
 * nothing will ever hit the db without taking care of
 * the patch will be extremely localized
 I wrote the patch, ran the wordpress tests, saw that the issue got solved,
 and all seemed fine to me.

 There are some questions that need to be answered:
 1. Does this solution slow WP down too much?
 1. Does this solution fail sometime?
 I've no clear answers, but hints.
 1. The escaping/unescaping are cheap operations, but they do examine a
 string char by char. In this use case, I already short-circuited any char
 MySQL can handle by itself (3 bytes UTF-8).
 1. Only strings are escaped/unescaped, the rest is short-circuited (at
 least when writing to the db, when reading all is a string), so I think
 that only a binary string could cause some troubles.
 I'd like to know your thoughts, and if the patch could find its way into
 some next (close) WP release.
 Here are the parts of the patch that change WP, the whole patch (with two
 added files) is instead attached.


 {{{
 diff -rupN --exclude-from wpdiffexclude.txt wordpress-3.0.4/wp-includes
 /wp-db.php wp-db-patched/wp-includes/wp-db.php
 --- wordpress-3.0.4/wp-includes/wp-db.php       2010-07-25
 08:34:50.000000000 +0200
 +++ wp-db-patched/wp-includes/wp-db.php 2011-01-18 12:51:56.000000000
 +0100
 @@ -1108,7 +1108,8 @@ class wpdb {
                         $dbh =& $this->dbh;
                         $this->last_db_used = "other/read";
                 }
 -
 +
 +               full_utf8_escape($query);
                 $this->result = @mysql_query( $query, $dbh );
                 $this->num_queries++;

 @@ -1136,8 +1137,9 @@ class wpdb {
                                 $i++;
                         }
                         $num_rows = 0;
 -                       while ( $row = @mysql_fetch_object( $this->result
 ) ) {
 -                               $this->last_result[$num_rows] = $row;
 +                       while ( $row = @mysql_fetch_assoc( $this->result )
 ) {
 +                           array_walk($row, 'full_utf8_unescape');
 +                               $this->last_result[$num_rows] = (object)
 $row;
                                 $num_rows++;
                         }

 diff -rupN --exclude-from wpdiffexclude.txt wordpress-3.0.4/wp-
 settings.php wp-db-patched/wp-settings.php
 --- wordpress-3.0.4/wp-settings.php     2010-05-02 23:18:36.000000000
 +0200
 +++ wp-db-patched/wp-settings.php       2011-01-16 20:19:32.000000000
 +0100
 @@ -66,6 +66,7 @@ wp_set_lang_dir();
  require( ABSPATH . WPINC . '/compat.php' );
  require( ABSPATH . WPINC . '/functions.php' );
  require( ABSPATH . WPINC . '/classes.php' );
 +require( ABSPATH . WPINC . '/full-utf8.php' );

  // Include the wpdb class, or a db.php database drop-in if present.
  require_wp_db();
 }}}

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/13590#comment:4>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list