[wp-trac] [WordPress Trac] #29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance
WordPress Trac
noreply at wordpress.org
Sat Sep 20 17:18:15 UTC 2014
#29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding,
iconv fix, performance
-----------------------------------------+-----------------------------
Reporter: askapache | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Formatting | Version: trunk
Severity: normal | Keywords:
Focuses: administration, performance |
-----------------------------------------+-----------------------------
Used in core in these 4 functions.
* esc_attr()
* esc_js()
* esc_html()
* sanitize_text_field()
It's the first function to execute for all 4, and especially for
sanitize_text_field it gets called quite a bit and is pretty important.
It's purpose is to check a string for invalid utf. It utilizes preg_match
with the '/u' modifier to parse both the pattern and subject for utf.
PCRE automatically checks both the pattern and subject for invalid utf,
upon which it will exit with an error code/constant.
The changes here: Normally pcre is compiled with utf support. It can
also be compiled to disallow utf support, and it can be compiled without
utf support. If utf is compiled and enabled the '/u' modifier for
preg_match is available which turns on the automatic utf validation.
For older dists or those with utf support turned off at compile, there is
a trick to enable the same functionality as the '/u' provides.
http://www.pcre.org/pcre.txt
In order process UTF-8 strings, you must build PCRE to include UTF-8
support in the code, and, in addition, you must call pcre_compile()
with the PCRE_UTF8 option flag, or the pattern must start with the
sequence (*UTF8). When either of these is the case, both the pattern
and any subject strings that are matched against it are treated as
UTF-8 strings instead of strings of 1-byte characters.
So the first change to this function was to allow a fallback to that
pattern option trick in case '/u' wasnt supported.
1. `@preg_match( '//u', '' ) !== false`
2. `@preg_match( '/(*UTF8)/', '' ) !== false`
3. Fallback to a regex that doesn't require UTF support, instead of using
pcre utf validation it searches for it
I also wanted it to have better performance, especially due to its use in
those 4 core functions I use often. I benchmarked it pretty thoroughly to
try and gain more speed. This patch is about 10-20% faster.
Many gains were from refactoring the logic and control structures,
chaining within if statements using bools, and utilizing the static
variables to the fullest. This is especially crucial since this function
gets called repeatedly. I also gained some cycles by replacing an
in_array() check with a `stripos`.
One of the bigger gains came from replacing the `strlen( $string ) == 0`
that ran on every run with. Since the $string variable was already casted
to a string, that should always work and keep things a little cheaper.
{{{
$string = (string) $string;
// if string length is 0 (faster than strlen) return empty
if ( ! isset( $string[0] ) )
return '';
}}}
The final change was to the 2nd parameters $strip, which if true is
supposed to strip the invalid utf out of the string and return the valid.
In core nowhere is that parameter being used (yet), which explains the
deprecated looking iconv. Also added a fallback to use mb_convert_encoding
in case iconv is missing.
{{{
// try to use iconv if exists
if ( function_exists( 'iconv' ) )
return @iconv( 'utf-8', 'utf-8//ignore', $string );
// otherwise try to use mb_convert_encoding, setting the
substitue_character to none to mimic strip
if ( function_exists( 'mb_convert_encoding' ) ) {
@ini_set( 'mbstring.substitute_character', 'none' );
return @mb_convert_encoding( $string, 'utf-8', 'utf-8' );
}
}}}
Here are some of the test strings I used, I also used the utf-8-test file
at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. I did
testing on 4.0 using php 5.6, 5.4, 5.3, and 5.4. I verified the output
and the strip feature as well. For all tests I had php error_reporting
set to the max:
{{{
ini_set( 'error_reporting', 2147483647 );
}}}
{{{
$valid_utf = array(
"\xc3\xb1", // 'Valid 2 Octet Sequence'
"\xe2\x82\xa1", // 'Valid 3 Octet Sequence' =>
"\xf0\x90\x8c\xbc", // 'Valid 4 Octet Sequence' =>
"\xf8\xa1\xa1\xa1\xa1", //'Valid 5 Octet Sequence (but not
Unicode!)' =>
"\xfc\xa1\xa1\xa1\xa1\xa1", //'Valid 6 Octet Sequence (but not
Unicode!)' =>
"Iñtërnâtiônàlizætiøn\xf0\x90\x8c\xbcIñtërnâtiônàlizætiøn", //
valid four octet id
'Iñtërnâtiônàlizætiøn', // valid UTF-8 string
"\xc3\xb1", // valid two octet id
"Iñtërnâtiônàlizætiøn\xe2\x82\xa1Iñtërnâtiônàlizætiøn", // valid
three octet id
);
$invalid_utf = array(
"\xc3\x28", //'Invalid 2 Octet Sequence' =>
"\xa0\xa1", //'Invalid Sequence Identifier' =>
"\xe2\x28\xa1", //'Invalid 3 Octet Sequence (in 2nd Octet)' =>
"\xe2\x82\x28", //'Invalid 3 Octet Sequence (in 3rd Octet)' =>
"\xf0\x28\x8c\xbc", //'Invalid 4 Octet Sequence (in 2nd Octet)' =>
"\xf0\x90\x28\xbc", // 'Invalid 4 Octet Sequence (in 3rd Octet)' =>
"\xf0\x28\x8c\x28", //'Invalid 4 Octet Sequence (in 4th Octet)' =>
chr(0xE3) . chr(0x80) . chr(0x22), // Invalid malformed because
0x22 is not a valid second trailing byte following the leading byte 0xE3.
http://www.unicode.org/reports/tr36/
chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), //
Invalid UTF-8, overlong 5 byte encoding.
chr(0xD0) . chr(0x01), // High code-point without trailing
characters.
chr(0xC0) . chr(0x80), // Overlong encoding of code point 0
chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), //
Overlong encoding of 5 byte encoding
chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) .
chr(0x80), // Overlong encoding of 6 byte encoding
chr(0xD0) . chr(0x01), // High code-point without trailing
characters
"Iñtërnâtiôn\xe9àlizætiøn", // invalid UTF-8 string
"Iñtërnâtiônàlizætiøn\xfc\xa1\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", //
invalid six octet sequence
"Iñtërnâtiônàlizætiøn\xf0\x28\x8c\xbcIñtërnâtiônàlizætiøn", //
invalid four octet sequence
"Iñtërnâtiônàlizætiøn \xc3\x28 Iñtërnâtiônàlizætiøn", // invalid
two octet sequence
"this is an invalid char '\xe9' here", // invalid ASCII string
"Iñtërnâtiônàlizætiøn\xa0\xa1Iñtërnâtiônàlizætiøn", // invalid id
between two and three
"Iñtërnâtiônàlizætiøn\xf8\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", //
invalid five octet sequence
"Iñtërnâtiônàlizætiøn\xe2\x82\x28Iñtërnâtiônàlizætiøn", // invalid
three octet sequence third
"Iñtërnâtiônàlizætiøn\xe2\x28\xa1Iñtërnâtiônàlizætiøn", // invalid
three octet sequence second
);
}}}
----
Notes and more info:
{{{
In order process UTF-8 strings, you must build PCRE to include
UTF-8
support in the code, and, in addition, you must call
pcre_compile()
with the PCRE_UTF8 option flag, or the pattern must start
with the
sequence (*UTF8). When either of these is the case, both the
pattern
and any subject strings that are matched against it are
treated as
UTF-8 strings instead of strings of 1-byte characters.
UTF-8 was devised in September 1992 by Ken Thompson, guided by design
criteria specified by Rob Pike, with the objective of defining a UCS
transformation format usable in the Plan9 operating system in a non-
disruptive manner.
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
A UTF-8 string is a sequence of octets representing a sequence of UCS
characters. An octet sequence is valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF of [RFC2234].
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
}}}
* http://www.pcre.org/pcre.txt
* http://us1.php.net/manual/en/pcre.constants.php
* http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
* http://en.wikipedia.org/wiki/Unicode
* http://unicode.org/faq/utf_bom.html
* http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
* http://www.pcre.org/pcre.txt
* http://tools.ietf.org/rfc/rfc3629.txt
* http://www.unicode.org/faq/utf_bom.html
* http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
* http://www.unicode.org/reports/tr36/
* http://tools.ietf.org/rfc/rfc3629.txt
Related Tickets:
* https://core.trac.wordpress.org/ticket/11175
* https://core.trac.wordpress.org/ticket/28786
--
Ticket URL: <https://core.trac.wordpress.org/ticket/29717>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list