[wp-trac] [WordPress Trac] #29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance

WordPress Trac noreply at wordpress.org
Wed Apr 10 08:28:59 UTC 2019


#29717: wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding,
iconv fix, performance
-------------------------------------------------+-------------------------
 Reporter:  askapache                            |       Owner:  (none)
     Type:  enhancement                          |      Status:  new
 Priority:  normal                               |   Milestone:  Awaiting
                                                 |  Review
Component:  Formatting                           |     Version:
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch dev-feedback needs-        |     Focuses:
  refresh needs-unit-tests                       |  performance
-------------------------------------------------+-------------------------

Comment (by kitchin):

 Proof of concept, needs unit tests. Passes my ad-hoc testing with
 @askapache's test strings.

 1. Speed up testing for an empty string. Indeed stackexchange says
 0==strlen($string) is slower than isset($string[0]). But ''==$string is
 almost as fast and matches the WP codebase.

 2. For stripping, iconv() misses some patterns in the test strings, on my
 platform at least. But the bytewise regex in `wpdb::strip_invalid_text()`
 finds them all (4 byte version). So use that.

 3. Add a new parameter $bytewise that controls use of the regex from wpdb.
 "Bytewise" here means without using "/u".

 The new parameter (set to 'always') should solve #38044 by providing a
 better check than `seems_utf8()`.

 By default the patch works the sane as trunk, when $strip is off. For
 $strip the patch uses the wpdb regex instead of `inconv()`. Note there's a
 slight bug in trunk since the return can be null instead of string if
 `inconv()` fails, and also `inconv()` should be `@inconv()`.

 Compared to @askapache 29757.5.patch this patch does not try to use
 '*UTF8' or `htmlspecialchars()` as fallbacks. The wpdb regex may be
 slower, but it's only used when "/u" is not available, or for the "not
 recommended" strip. It's five years later now, so platforms are better,
 and "not recommended" has been in the codebase longer than that.

 Note the code patched has not changed logically since WP 4.0, approx. when
 this bug started.

 (I'm going to post an updated patch that fixes a bug.)

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/29717#comment:24>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list