[wp-trac] [WordPress Trac] #11738: sanitize_text_field() issue with UTF-8 characters
WordPress Trac
wp-trac at lists.automattic.com
Sun Jan 10 03:50:42 UTC 2010
#11738: sanitize_text_field() issue with UTF-8 characters
--------------------------+-------------------------------------------------
Reporter: hakre | Owner: hakre
Type: defect (bug) | Status: new
Priority: normal | Milestone: 3.0
Component: Charset | Version: 2.9.1
Severity: normal | Keywords: needs-patch
--------------------------+-------------------------------------------------
Comment(by hakre):
Replying to [comment:8 azaozz]:
> It seems it should be but as far as I remember we had some issues with
it in the past.
Pleae provide references to them, this might be related to PHP version <
4.2.
> Also PCRE (the library) can be build without UTF-8 support, in fact it
seems it's disabled by default:
>
> "If you want to make use of the support for UTF-8 Unicode character
strings in
> PCRE, you must add --enable-utf8 to the "configure" command. (...)
>
> http://www.pcre.org/readme.txt
PHP preg_ functions are not the original PCRE. Please refer to PHP and
it's documentation and not to PERL and it's documentation regarding
functionality incl. modifiers, especially the /u modifier. There are
differences between PCRE in Perl and the PCRE bundeled with PHP.
---
Replying to [comment:11 azaozz]:
> Replying to [comment:5 hakre]:
> > We have functions that are working independently from php extenstions
like ''seems_utf8()'' for example. In another patch I offer a fallback
save implementation as ''is_valid_utf8()'' that does the job in any case
even if the preg functions do not support any u-modifier.
>
> So in some functions we would insist on using PCRE UTF-8 while in other
we would bypass it and use a fallback?
Just wanted to give an example that it's possible to write regex that can
handle utf-8 validation w/o the /u modifier properly (even for shift-space
and the like). For this case here I foremost thought the /u modifier does
make the pattern more simple to read. It can be done w/o the /u modifier
''properly'' as well.
> > In the other ticket there is the test-case this function needs to cope
with, those russian letters in UTF8. Prior to commit of the last patch
that was the only thing "tested" against. No further review of the patch
nor further tests.
>
> No, I've tested it with several UTF-8 locales, mostly Asian languages
and since it was a regression the choices were either to fix it reliably
or remove it. Reported cases: #11528, #11669, #11619.
Thats good to know, can you please reference / attach those tests so they
can be used to test an alternative approach to #11528 / [12499].
> Another possible inconsistency: seems that "\s" can match slightly
different characters on different systems or versions of PCRE.
What \s matches
[http://www.php.net/manual/en/regexp.reference.backslash.php is describben
here]. \s is matching any whitespace character (which includes xA0 / 160
that's why we have the bugreports in the other tickets).
> In that terms forcing \s with the "u" modifier doesn't look right. If
the blog charset is UTF-8 we can detect whether PCRE supports it (like in
wp_check_invalid_utf8) and use it but will need to have a fallback that
matches specific characters as it is currently.
\s with the /u modifier looks pretty right. Whitespace UTF-8 save. Done.
Well with PHP 4.3 /u modifier is there - anytime (if not PHP was compiled
--without-pcre-regex but then the ''ascii 7bit centric "fix"'' won't work
either). But as I've already written, there is no need for /u, it can be
propperly handeled without /u but also taking care of valid UTF8 multibyte
sequences to remove whitespaces.
> Having a global to store this and replace the static currently used in
wp_check_invalid_utf8() seems the proper approach. Don't see why we should
run the test whether PCRE UTF-8 is available every time one of these
functions is called.
That's right, I've been a bit shortsighted here. A variable can save a
decision for the length of a request, true. The options I have here are
(no specific order):
* native PHP code
* preg_ functions (enabled per default since 4.2.0)
* mb_ functions (non-default extenstion)
* iconv (non-default extenstion)
A static per function might do the best job since depending on what to do
(validating, filtering) there might be different priorities of functions
to use and as long this is on a beginning level of implementation I do not
want to mess around with globals that much.
--
Ticket URL: <http://core.trac.wordpress.org/ticket/11738#comment:12>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list