[wp-trac] [WordPress Trac] #11738: sanitize_text_field() issue with UTF-8 characters
WordPress Trac
wp-trac at lists.automattic.com
Thu Jan 7 01:28:24 UTC 2010
#11738: sanitize_text_field() issue with UTF-8 characters
--------------------------+-------------------------------------------------
Reporter: hakre | Owner: hakre
Type: defect (bug) | Status: new
Priority: normal | Milestone: 3.0
Component: Charset | Version: 2.9.1
Severity: normal | Keywords: needs-patch
--------------------------+-------------------------------------------------
Changes (by azaozz):
* severity: major => normal
* milestone: 2.9.2 => 3.0
Comment:
Replying to [comment:5 hakre]:
> Setting a static and/or global does not help since on each function call
the input might have a different encoding.
Don't think so. This filters text coming either from the browser (wp-
admin) or the db. In both cases we set the encoding according to
`get_option('blog_charset')`. If the encoding doesn't match, that would
mean the text is not coming from a "proper" place and it would probably
fail the wp_check_invalid_utf8() test which is run on the same string.
> We have functions that are working independently from php extenstions
like ''seems_utf8()'' for example. In another patch I offer a fallback
save implementation as ''is_valid_utf8()'' that does the job in any case
even if the preg functions do not support any u-modifier.
So in some functions we would insist on using PCRE UTF-8 while in other we
would bypass it and use a fallback?
> In the other ticket there is the test-case this function needs to cope
with, those russian letters in UTF8. Prior to commit of the last patch
that was the only thing "tested" against. No further review of the patch
nor further tests.
No, I've tested it with several UTF-8 locales, mostly Asian languages and
since it was a regression the choices were either to fix it reliably or
remove it. Reported cases: #11528, #11669, #11619.
Another possible inconsistency: seems that "\s" can match slightly
different characters on different systems or versions of PCRE.
In that terms forcing \s with the "u" modifier doesn't look right. If the
blog charset is UTF-8 we can detect whether PCRE supports it (like in
wp_check_invalid_utf8) and use it but will need to have a fallback that
matches specific characters as it is currently.
Having a global to store this and replace the static currently used in
wp_check_invalid_utf8() seems the proper approach. Don't see why we should
run the test whether PCRE UTF-8 is available every time one of these
functions is called.
--
Ticket URL: <http://core.trac.wordpress.org/ticket/11738#comment:11>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list