[wp-trac] [WordPress Trac] #11738: sanitize_text_field() issue with UTF-8 characters

Thu Jan 7 01:28:24 UTC 2010

#11738: sanitize_text_field() issue with UTF-8 characters
--------------------------+-------------------------------------------------
 Reporter:  hakre         |       Owner:  hakre      
     Type:  defect (bug)  |      Status:  new        
 Priority:  normal        |   Milestone:  3.0        
Component:  Charset       |     Version:  2.9.1      
 Severity:  normal        |    Keywords:  needs-patch
--------------------------+-------------------------------------------------
Changes (by azaozz):

  * severity:  major => normal
  * milestone:  2.9.2 => 3.0

Comment:

 Replying to [comment:5 hakre]:
 > Setting a static and/or global does not help since on each function call
 the input might have a different encoding.

 Don't think so. This filters text coming either from the browser (wp-
 admin) or the db. In both cases we set the encoding according to
 `get_option('blog_charset')`. If the encoding doesn't match, that would
 mean the text is not coming from a "proper" place and it would probably
 fail the wp_check_invalid_utf8() test which is run on the same string.

 > We have functions that are working independently from php extenstions
 like ''seems_utf8()'' for example. In another patch I offer a fallback
 save implementation as ''is_valid_utf8()'' that does the job in any case
 even if the preg functions do not support any u-modifier.

 So in some functions we would insist on using PCRE UTF-8 while in other we
 would bypass it and use a fallback?

 > In the other ticket there is the test-case this function needs to cope
 with, those russian letters in UTF8. Prior to commit of the last patch
 that was the only thing "tested" against. No further review of the patch
 nor further tests.

 No, I've tested it with several UTF-8 locales, mostly Asian languages and
 since it was a regression the choices were either to fix it reliably or
 remove it. Reported cases: #11528, #11669, #11619.

 Another possible inconsistency: seems that "\s" can match slightly
 different characters on different systems or versions of PCRE.

 In that terms forcing \s with the "u" modifier doesn't look right. If the
 blog charset is UTF-8 we can detect whether PCRE supports it (like in
 wp_check_invalid_utf8) and use it but will need to have a fallback that
 matches specific characters as it is currently.

 Having a global to store this and replace the static currently used in
 wp_check_invalid_utf8() seems the proper approach. Don't see why we should
 run the test whether PCRE UTF-8 is available every time one of these
 functions is called.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/11738#comment:11>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software