[wp-trac] [WordPress Trac] #29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance

WordPress Trac noreply at wordpress.org
Sat Sep 20 19:43:39 UTC 2014


#29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding,
iconv fix, performance
--------------------------------+------------------------------------------
 Reporter:  askapache           |       Owner:
     Type:  enhancement         |      Status:  new
 Priority:  normal              |   Milestone:  Awaiting Review
Component:  Formatting          |     Version:  trunk
 Severity:  normal              |  Resolution:
 Keywords:  has-patch dev-      |     Focuses:  administration, performance
  feedback                      |
--------------------------------+------------------------------------------

Comment (by askapache):

 Replying to [comment:2 miqrogroove]:
 > Impressive.  So the main benefits are 10% faster and more compatibility?
 Are there any systems currently running WordPress that need this patch?  A
 more concise, big picture description would help.
 >
 > Also, I learned in feedback from the 4.0 release that we need to
 specifically test PHP versions less than 5.4.9 and 5.3.19, because they
 exhibit crashes when PCRE is used to perform certain types of alternation
 and backtracking.  I found that version 5.2.13 is particularly easy to
 download.  It is not necessary to add unit tests for that, but we need to
 see that if someone posts a 10kb or 100kb block of text that it won't
 suddenly crash due to a server bug.

 The updates don't actually change the behaviour of this function unless:

 1. You are one of those with a site with an older pcre lacking utf-8
 support, in which case those 4 functions will now correctly filter and
 check for invalid utf.
 2. You use the `strip` parameter to actually remove invalid utf for a
 plugin or theme, in which case it will now work correctly.  That was a bug
 fix.

 Some folks have pcre compiled without utf support enabled or with utf-
 support missing, or disabled, so for them the '/u' doesn't work which
 results in essentially this entire check being skipped.

 This is also somewhat of a security issue, such as the whole IDN domain
 issues and other utf exploits.  The big big picture with this is to update
 the function to more easily developed and used, it hadn't been updated for
 quite a while.  This should make it easier to update/extend/move this
 function down the road, I think some people may have wrongly assumed that
 it was doing more than it was.  It's kind of a strange function, to take a
 string as input and either return it as is, or return a blank string in
 case of invalid utf-8.  But that's actually really clever, it's much safer
 and faster that way, just not so clear.

 I've noticed several plugins like disqus and yoast seo have started to
 build their own incarnations of this function, this update should help
 make clear what it is and isn't.

 I have tested on PHP 5.2, I approached this with extreme caution to avoid
 causing any issues.  IOW, this function will also work on 5.0.  The only
 reason it wouldn't work for php 4.x is that `stripos` wasn't available as
 a builtin zend function until 5.0, but I noticed it's being used in
 several places in core so.. ( I am still used to having to code backwards
 for 4.x, so happy that's officially over for WP).

 The big change are the 2 new fallbacks to the original preg_match,
 including the custom regex, which will be the fallback for those with
 absolutely no utf pcre capability. It has to be a rarity for that to ever
 actually be needed, but that's the only possible issue I can see with
 regard to buffer issues or memory problems.  preg_match isn't as efficient
 as a builtin function such as strpos, but it is pretty darn efficient.

 The other big change is making the 'strip' parameter work, since it isn't
 actually being used by any core, it seems to have been forgotten about a
 little.  With it now working, I will start using it in plugins and themes
 to sanitize utf-8 (because this is super fast).  That's actually why I
 initially started on this.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/29717#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list