[wp-trac] [WordPress Trac] #11738: sanitize_text_field() issue with UTF-8 characters

WordPress Trac wp-trac at lists.automattic.com
Sun Jan 10 03:50:42 UTC 2010


#11738: sanitize_text_field() issue with UTF-8 characters
--------------------------+-------------------------------------------------
 Reporter:  hakre         |       Owner:  hakre      
     Type:  defect (bug)  |      Status:  new        
 Priority:  normal        |   Milestone:  3.0        
Component:  Charset       |     Version:  2.9.1      
 Severity:  normal        |    Keywords:  needs-patch
--------------------------+-------------------------------------------------

Comment(by hakre):

 Replying to [comment:8 azaozz]:
 > It seems it should be but as far as I remember we had some issues with
 it in the past.

 Pleae provide references to them, this might be related to PHP version <
 4.2.

 > Also PCRE (the library) can be build without UTF-8 support, in fact it
 seems it's disabled by default:
 >
 >   "If you want to make use of the support for UTF-8 Unicode character
 strings in
 >   PCRE, you must add --enable-utf8 to the "configure" command. (...)
 >
 > http://www.pcre.org/readme.txt

 PHP preg_ functions are not the original PCRE. Please refer to PHP and
 it's documentation and not to PERL and it's documentation regarding
 functionality incl. modifiers, especially the /u modifier. There are
 differences between PCRE in Perl and the PCRE bundeled with PHP.

 ---

 Replying to [comment:11 azaozz]:
 > Replying to [comment:5 hakre]:
 > > We have functions that are working independently from php extenstions
 like ''seems_utf8()'' for example. In another patch I offer a fallback
 save implementation as ''is_valid_utf8()'' that does the job in any case
 even if the preg functions do not support any u-modifier.
 >
 > So in some functions we would insist on using PCRE UTF-8 while in other
 we would bypass it and use a fallback?

 Just wanted to give an example that it's possible to write regex that can
 handle utf-8 validation w/o the /u modifier properly (even for shift-space
 and the like). For this case here I foremost thought the /u modifier does
 make the pattern more simple to read. It can be done w/o the /u modifier
 ''properly'' as well.

 > > In the other ticket there is the test-case this function needs to cope
 with, those russian letters in UTF8. Prior to commit of the last patch
 that was the only thing "tested" against. No further review of the patch
 nor further tests.
 >
 > No, I've tested it with several UTF-8 locales, mostly Asian languages
 and since it was a regression the choices were either to fix it reliably
 or remove it. Reported cases: #11528, #11669, #11619.

 Thats good to know, can you please reference / attach those tests so they
 can be used to test an alternative approach to #11528 / [12499].

 > Another possible inconsistency: seems that "\s" can match slightly
 different characters on different systems or versions of PCRE.
 What \s matches
 [http://www.php.net/manual/en/regexp.reference.backslash.php is describben
 here]. \s is matching any whitespace character (which includes xA0 / 160
 that's why we have the bugreports in the other tickets).

 > In that terms forcing \s with the "u" modifier doesn't look right. If
 the blog charset is UTF-8 we can detect whether PCRE supports it (like in
 wp_check_invalid_utf8) and use it but will need to have a fallback that
 matches specific characters as it is currently.

 \s with the /u modifier looks pretty right. Whitespace UTF-8 save. Done.

 Well with PHP 4.3 /u modifier is there - anytime (if not PHP was compiled
 --without-pcre-regex but then the ''ascii 7bit centric "fix"'' won't work
 either). But as I've already written, there is no need for /u, it can be
 propperly handeled without /u but also taking care of valid UTF8 multibyte
 sequences to remove whitespaces.

 > Having a global to store this and replace the static currently used in
 wp_check_invalid_utf8() seems the proper approach. Don't see why we should
 run the test whether PCRE UTF-8 is available every time one of these
 functions is called.

 That's right, I've been a bit shortsighted here. A variable can save a
 decision for the length of a request, true. The options I have here are
 (no specific order):

  * native PHP code
  * preg_ functions (enabled per default since 4.2.0)
  * mb_ functions (non-default extenstion)
  * iconv (non-default extenstion)

 A static per function might do the best job since depending on what to do
 (validating, filtering) there might be different priorities of functions
 to use and as long this is on a beginning level of implementation I do not
 want to mess around with globals that much.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/11738#comment:12>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list