[wp-trac] [WordPress Trac] #22363: Accents in attachment filenames should be sanitized
WordPress Trac
noreply at wordpress.org
Sat Sep 3 22:37:48 UTC 2016
#22363: Accents in attachment filenames should be sanitized
-------------------------------------+------------------
Reporter: tar.gz | Owner:
Type: defect (bug) | Status: new
Priority: normal | Milestone: 4.7
Component: Permalinks | Version: 3.4
Severity: critical | Resolution:
Keywords: needs-testing has-patch | Focuses:
-------------------------------------+------------------
Comment (by gitlost):
I think there's a number of problems with the latest proposed patch.
The encoding of the filename can only be guessed at as far as I know -
it's just a string of bytes with no encoding info from the client and is
not connected to the page's (or blog's) charset, so there should be no
defaulting to the blog's charset. The default should just be nothing.
It's not clear to me why `wp_strip_all_tags()` and `html_entity_decode()`
and entity stripping are being done, as opposed to the current solution of
just stripping angle brackets and ampersands via `special_chars`, which
seems to be as effective and is less drastic.
A small point is that the conversion of the `special_chars` filter to a
concatenated character class regex is not strictly backward compatible, as
people could have been doing multi-byte stripping, which would fail now.
The use of the `$utf8_modifier` is only needed when the regex contains a
multi-byte pattern, and there's only 2 places in the proposed patch where
that is (intentionally) the case, one of which is problematic `[\s-]+` and
the other arguable `(?!\.)[^\p{L}\p{Nd}]+`. A third use on
`^[a-zA-Z]{2,5}\d?$` looks unintentional as `\d` will match `\p{Nd}` if
UTF-8 mode is set and PCRE has
[http://php.net/manual/en/regexp.reference.unicode.php UCP] support (which
unfortunately are not necessarily the same thing - see #22692#comment:39).
Other uses are unneeded.
The `[\s-]+` use is problematic and as it stands could actually re-
introduce the bug that corrupts UTF-8 that was fixed up above (see #26094)
if the filename is in UTF-8 and PCRE UTF-8 mode is unavailable (as
`$utf9_modifier` would then be blank). Even if that's fixed it still has
issues due to the difficulty in predicting what it will match - in single-
byte mode it's PHP locale dependent, and in UTF-8 mode it's UCP dependent.
Its introduction appears to be driven by wanting to encompass both the
`U+00A0` substitution (see #27281 but it's not particularly clear from
that bug report whether the fix was appropriate) and the simple
substitution `[\r\n\t -]+`. I think it would be better and more targeted
to leave the current `[\r\n\t -]+` as is, and to replace the current
`preg_match( "#\x{00a0}#siu", ' ', $filename )` with a straightforward
`str_replace( "\xc2\xa0", ' ', $filename )` (to be done only when the
filename is in UTF-8).
The `(?!\.)[^\p{L}\p{Nd}]+` use should only be done if the filename is in
UTF-8, and also technically should check that UCP support is available, as
a previous patch
[https://core.trac.wordpress.org/attachment/ticket/22363/22363.5.patch
22363.5.patch] above does. This is the substantive change that fixes up
the bug in `remove_accents()`, and works. However I think the bug in
`remove_accents()` - which could be renamed
`transliterate_commonly_used_latin_to_ascii()` - should be fixed in
`remove_accents()`, as specified in the branched off bug #24661. Also to
be picky(!) it removes too much, not just decomposed latin accents. The
solutions mentioned in #24661 are more targeted.
The use of `mb_strtolower()` should probably also only be done if the
filename is in UTF-8, and then explicitly pass `UTF-8` as the encoding
(see `sanitize_title_with_dashes()`), rather than using
`mb_detect_encoding()`, which has [http://php.net/manual/en/function.mb-
detect-encoding.php#102510 issues].
--
Ticket URL: <https://core.trac.wordpress.org/ticket/22363#comment:79>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list