[wp-trac] [WordPress Trac] #22692: Quotes Are Messing Up

WordPress Trac noreply at wordpress.org
Sat Nov 9 22:17:27 UTC 2013


#22692: Quotes Are Messing Up
--------------------------+------------------
 Reporter:  miqrogroove   |       Owner:
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  3.8
Component:  Formatting    |     Version:  1.2
 Severity:  normal        |  Resolution:
 Keywords:  has-patch     |
--------------------------+------------------

Comment (by azaozz):

 Yeah, now we are getting somewhere...

 PHP 5.4.16 on Windows 7:

 {{{
 var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

 var_dump( setlocale( LC_ALL, '' ) ); // On Windows this sets it to the
 system default
 // string(19) "English_Canada.1252"

 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
 var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false) (as \xA0 is
 not full UTF char)

 setlocale( LC_ALL, 'C' );
 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
 var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false)
 }}}

 PHP 5.3.1 on Mac OSX (note: 5.3.1 doesn't set PCRE_UCP with the `u`
 modifier)

 {{{
 var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

 setlocale( LC_ALL, 'en_CA' ); // Also with 'en_CA.UTF-8',
 'en_CA.ISO8859-1', 'en_CA.ISO8859-15', etc.
 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
 var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

 setlocale( LC_ALL, 'C' ); // Also with 'en_CA.US-ASCII'
 var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
 var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)
 }}}

 So when the locale is anything other than `C` or `US-ASCII` equivalent,
 `\s` matches `\xA0`. Also on multithreaded servers like Apache on Windows,
 setlocale() is "sticky",
 [http://php.net/manual/en/function.setlocale.php#refsect1-function
 .setlocale-notes more info].

 From the PCRE manual:

 {{{
 PCRE  handles  caseless matching, and determines whether characters are
 letters, digits, or whatever, by reference to a set of tables,  indexed
 by  character  value.  When running in UTF-8 mode, this applies only to
 characters with codes less than 128.
 }}}

 It is not mentioned there but seems `\s` is also affected by these tables.

 {{{
 The internal tables can always be overridden by tables supplied by  the
 application that calls PCRE... External  tables  are  built by calling
 pcre_maketables()
 }}}

 PHP uses pcre_maketables() only if the locale is 'C':
 http://git.php.net/?p=php-
 src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l392

 So it seems that the unexpected behavior of `\s` is caused by PCRE when a
 locale other than `C` is set.

--
Ticket URL: <http://core.trac.wordpress.org/ticket/22692#comment:58>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list