[wp-trac] [WordPress Trac] #22363: Accents in attachment filenames should be sanitized

Sat Sep 3 22:37:48 UTC 2016

#22363: Accents in attachment filenames should be sanitized
-------------------------------------+------------------
 Reporter:  tar.gz                   |       Owner:
     Type:  defect (bug)             |      Status:  new
 Priority:  normal                   |   Milestone:  4.7
Component:  Permalinks               |     Version:  3.4
 Severity:  critical                 |  Resolution:
 Keywords:  needs-testing has-patch  |     Focuses:
-------------------------------------+------------------

Comment (by gitlost):

 I think there's a number of problems with the latest proposed patch.

 The encoding of the filename can only be guessed at as far as I know -
 it's just a string of bytes with no encoding info from the client and is
 not connected to the page's (or blog's) charset, so there should be no
 defaulting to the blog's charset. The default should just be nothing.

 It's not clear to me why `wp_strip_all_tags()` and `html_entity_decode()`
 and entity stripping are being done, as opposed to the current solution of
 just stripping angle brackets and ampersands via `special_chars`, which
 seems to be as effective and is less drastic.

 A small point is that the conversion of the `special_chars` filter to a
 concatenated character class regex is not strictly backward compatible, as
 people could have been doing multi-byte stripping, which would fail now.

 The use of the `$utf8_modifier` is only needed when the regex contains a
 multi-byte pattern, and there's only 2 places in the proposed patch where
 that is (intentionally) the case, one of which is problematic `[\s-]+` and
 the other arguable `(?!\.)[^\p{L}\p{Nd}]+`. A third use on
 `^[a-zA-Z]{2,5}\d?$` looks unintentional as `\d` will match `\p{Nd}` if
 UTF-8 mode is set and PCRE has
 [http://php.net/manual/en/regexp.reference.unicode.php UCP] support (which
 unfortunately are not necessarily the same thing - see #22692#comment:39).
 Other uses are unneeded.

 The `[\s-]+` use is problematic and as it stands could actually re-
 introduce the bug that corrupts UTF-8 that was fixed up above (see #26094)
 if the filename is in UTF-8 and PCRE UTF-8 mode is unavailable (as
 `$utf9_modifier` would then be blank). Even if that's fixed it still has
 issues due to the difficulty in predicting what it will match - in single-
 byte mode it's PHP locale dependent, and in UTF-8 mode it's UCP dependent.
 Its introduction appears to be driven by wanting to encompass both the
 `U+00A0` substitution (see #27281 but it's not particularly clear from
 that bug report whether the fix was appropriate) and the simple
 substitution `[\r\n\t -]+`. I think it would be better and more targeted
 to leave the current `[\r\n\t -]+` as is, and to replace the current
 `preg_match( "#\x{00a0}#siu", ' ', $filename )` with a straightforward
 `str_replace( "\xc2\xa0", ' ', $filename  )` (to be done only when the
 filename is in UTF-8).

 The `(?!\.)[^\p{L}\p{Nd}]+` use should only be done if the filename is in
 UTF-8, and also technically should check that UCP support is available, as
 a previous patch
 [https://core.trac.wordpress.org/attachment/ticket/22363/22363.5.patch
 22363.5.patch] above does. This is the substantive change that fixes up
 the bug in `remove_accents()`, and works. However I think the bug in
 `remove_accents()` - which could be renamed
 `transliterate_commonly_used_latin_to_ascii()` - should be fixed in
 `remove_accents()`, as specified in the branched off bug #24661. Also to
 be picky(!) it removes too much, not just decomposed latin accents. The
 solutions mentioned in #24661 are more targeted.

 The use of `mb_strtolower()` should probably also only be done if the
 filename is in UTF-8, and then explicitly pass `UTF-8` as the encoding
 (see `sanitize_title_with_dashes()`), rather than using
 `mb_detect_encoding()`, which has [http://php.net/manual/en/function.mb-
 detect-encoding.php#102510 issues].

--
Ticket URL: <https://core.trac.wordpress.org/ticket/22363#comment:79>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform