[wp-trac] [WordPress Trac] #29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance

Sat Sep 20 17:18:15 UTC 2014

#29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding,
iconv fix, performance
-----------------------------------------+-----------------------------
 Reporter:  askapache                    |      Owner:
     Type:  enhancement                  |     Status:  new
 Priority:  normal                       |  Milestone:  Awaiting Review
Component:  Formatting                   |    Version:  trunk
 Severity:  normal                       |   Keywords:
  Focuses:  administration, performance  |
-----------------------------------------+-----------------------------
 Used in core in these 4 functions.

 * esc_attr()
 * esc_js()
 * esc_html()
 * sanitize_text_field()

 It's the first function to execute for all 4, and especially for
 sanitize_text_field it gets called quite a bit and is pretty important.

 It's purpose is to check a string for invalid utf.  It utilizes preg_match
 with the '/u' modifier to parse both the pattern and subject for utf.
 PCRE automatically checks both the pattern and subject for invalid utf,
 upon which it will exit with an error code/constant.

 The changes here:  Normally pcre is compiled with utf support.  It can
 also be compiled to disallow utf support, and it can be compiled without
 utf support.  If utf is compiled and enabled the '/u' modifier for
 preg_match is available which turns on the automatic utf validation.

 For older dists or those with utf support turned off at compile, there is
 a trick to enable the same functionality as the '/u' provides.

   http://www.pcre.org/pcre.txt
   In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
   support in the code, and, in addition,  you  must  call  pcre_compile()
   with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
   sequence (*UTF8). When either of these is the case,  both  the  pattern
   and  any  subject  strings  that  are matched against it are treated as
   UTF-8 strings instead of strings of 1-byte characters.

 So the first change to this function was to allow a fallback to that
 pattern option trick in case '/u' wasnt supported.

 1. `@preg_match( '//u', '' ) !== false`
 2. `@preg_match( '/(*UTF8)/', '' ) !== false`
 3. Fallback to a regex that doesn't require UTF support, instead of using
 pcre utf validation it searches for it

 I also wanted it to have better performance, especially due to its use in
 those 4 core functions I use often. I benchmarked it pretty thoroughly to
 try and gain more speed. This patch is about 10-20% faster.

 Many gains were from refactoring the logic and control structures,
 chaining within if statements using bools, and utilizing the static
 variables to the fullest.  This is especially crucial since this function
 gets called repeatedly.  I also gained some cycles by replacing an
 in_array() check with a `stripos`.

 One of the bigger gains came from replacing the `strlen( $string ) == 0`
 that ran on every run with.  Since the $string variable was already casted
 to a string, that should always work and keep things a little cheaper.

 {{{
 $string = (string) $string;

 // if string length is 0 (faster than strlen) return empty
 if ( ! isset( $string[0] ) )
         return '';
 }}}

 The final change was to the 2nd parameters $strip, which if true is
 supposed to strip the invalid utf out of the string and return the valid.
 In core nowhere is that parameter being used (yet), which explains the
 deprecated looking iconv. Also added a fallback to use mb_convert_encoding
 in case iconv is missing.

 {{{
 // try to use iconv if exists
 if ( function_exists( 'iconv' ) )
         return @iconv( 'utf-8', 'utf-8//ignore', $string );

 // otherwise try to use mb_convert_encoding, setting the
 substitue_character to none to mimic strip
 if ( function_exists( 'mb_convert_encoding' ) ) {
         @ini_set( 'mbstring.substitute_character', 'none' );
         return @mb_convert_encoding( $string, 'utf-8', 'utf-8' );
 }
 }}}

 Here are some of the test strings I used, I also used the utf-8-test file
 at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.  I did
 testing on 4.0 using php 5.6, 5.4, 5.3, and 5.4.  I verified the output
 and the strip feature as well.  For all tests I had php error_reporting
 set to the max:

 {{{
 ini_set( 'error_reporting', 2147483647 );
 }}}

 {{{
 $valid_utf = array(
         "\xc3\xb1", // 'Valid 2 Octet Sequence'
         "\xe2\x82\xa1", // 'Valid 3 Octet Sequence' =>
         "\xf0\x90\x8c\xbc", // 'Valid 4 Octet Sequence' =>
         "\xf8\xa1\xa1\xa1\xa1", //'Valid 5 Octet Sequence (but not
 Unicode!)' =>
         "\xfc\xa1\xa1\xa1\xa1\xa1", //'Valid 6 Octet Sequence (but not
 Unicode!)' =>
         "Iñtërnâtiônàlizætiøn\xf0\x90\x8c\xbcIñtërnâtiônàlizætiøn", //
 valid four octet id
         'Iñtërnâtiônàlizætiøn', // valid UTF-8 string
         "\xc3\xb1", // valid two octet id
         "Iñtërnâtiônàlizætiøn\xe2\x82\xa1Iñtërnâtiônàlizætiøn", // valid
 three octet id
 );

 $invalid_utf = array(
         "\xc3\x28", //'Invalid 2 Octet Sequence' =>
         "\xa0\xa1", //'Invalid Sequence Identifier' =>
         "\xe2\x28\xa1", //'Invalid 3 Octet Sequence (in 2nd Octet)' =>
         "\xe2\x82\x28", //'Invalid 3 Octet Sequence (in 3rd Octet)' =>
         "\xf0\x28\x8c\xbc", //'Invalid 4 Octet Sequence (in 2nd Octet)' =>
    "\xf0\x90\x28\xbc", // 'Invalid 4 Octet Sequence (in 3rd Octet)' =>
         "\xf0\x28\x8c\x28", //'Invalid 4 Octet Sequence (in 4th Octet)' =>
         chr(0xE3) . chr(0x80) . chr(0x22), // Invalid malformed because
 0x22 is not a valid second trailing byte following the leading byte 0xE3.
 http://www.unicode.org/reports/tr36/
         chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), //
 Invalid UTF-8, overlong 5 byte encoding.
         chr(0xD0) . chr(0x01), // High code-point without trailing
 characters.
         chr(0xC0) . chr(0x80), // Overlong encoding of code point 0
         chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), //
 Overlong encoding of 5 byte encoding
         chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) .
 chr(0x80),  // Overlong encoding of 6 byte encoding
         chr(0xD0) . chr(0x01), // High code-point without trailing
 characters
         "Iñtërnâtiôn\xe9àlizætiøn", // invalid UTF-8 string
 "Iñtërnâtiônàlizætiøn\xfc\xa1\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", //
 invalid six octet sequence
         "Iñtërnâtiônàlizætiøn\xf0\x28\x8c\xbcIñtërnâtiônàlizætiøn", //
 invalid four octet sequence
         "Iñtërnâtiônàlizætiøn \xc3\x28 Iñtërnâtiônàlizætiøn", // invalid
 two octet sequence
         "this is an invalid char '\xe9' here", // invalid ASCII string
         "Iñtërnâtiônàlizætiøn\xa0\xa1Iñtërnâtiônàlizætiøn", // invalid id
 between two and three
         "Iñtërnâtiônàlizætiøn\xf8\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", //
 invalid five octet sequence
         "Iñtërnâtiônàlizætiøn\xe2\x82\x28Iñtërnâtiônàlizætiøn", // invalid
 three octet sequence third
         "Iñtërnâtiônàlizætiøn\xe2\x28\xa1Iñtërnâtiônàlizætiøn", // invalid
 three octet sequence second
 );
 }}}

 ----

 Notes and more info:

 {{{

         In  order  process  UTF-8 strings, you must build PCRE to include
 UTF-8
         support in the code, and, in addition,  you  must  call
 pcre_compile()
         with  the  PCRE_UTF8  option  flag,  or the pattern must start
 with the
         sequence (*UTF8). When either of these is the case,  both  the
 pattern
         and  any  subject  strings  that  are matched against it are
 treated as
         UTF-8 strings instead of strings of 1-byte characters.

    UTF-8 was devised in September 1992 by Ken Thompson, guided by design
    criteria specified by Rob Pike, with the objective of defining a UCS
    transformation format usable in the Plan9 operating system in a non-
    disruptive manner.

     Char. number range  |        UTF-8 octet sequence
       (hexadecimal)    |              (binary)
    --------------------+---------------------------------------------
    0000 0000-0000 007F | 0xxxxxxx
    0000 0080-0000 07FF | 110xxxxx 10xxxxxx
    0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
    0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    A UTF-8 string is a sequence of octets representing a sequence of UCS
    characters.  An octet sequence is valid UTF-8 only if it matches the
    following syntax, which is derived from the rules for encoding UTF-8
    and is expressed in the ABNF of [RFC2234].

    UTF8-octets = *( UTF8-char )
    UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
    UTF8-1      = %x00-7F
    UTF8-2      = %xC2-DF UTF8-tail
    UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                  %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
    UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                  %xF4 %x80-8F 2( UTF8-tail )
    UTF8-tail   = %x80-BF

 }}}

  * http://www.pcre.org/pcre.txt
  * http://us1.php.net/manual/en/pcre.constants.php
  * http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
  * http://en.wikipedia.org/wiki/Unicode
  * http://unicode.org/faq/utf_bom.html
  * http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
  * http://www.pcre.org/pcre.txt
  * http://tools.ietf.org/rfc/rfc3629.txt
  * http://www.unicode.org/faq/utf_bom.html
  * http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
  * http://www.unicode.org/reports/tr36/
  * http://tools.ietf.org/rfc/rfc3629.txt

 Related Tickets:

  * https://core.trac.wordpress.org/ticket/11175
  * https://core.trac.wordpress.org/ticket/28786

--
Ticket URL: <https://core.trac.wordpress.org/ticket/29717>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform