[wp-trac] [WordPress Trac] #50855: sanitize_file_name function not working as expected if there is '%20' in filename

WordPress Trac noreply at wordpress.org
Wed Mar 30 15:16:08 UTC 2022


#50855: sanitize_file_name function not working as expected if there is '%20' in
filename
-------------------------------------------------+-------------------------
 Reporter:  dishitpala                           |       Owner:  audrasjb
     Type:  defect (bug)                         |      Status:  accepted
 Priority:  normal                               |   Milestone:  6.0
Component:  Formatting                           |     Version:  5.4.2
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch has-screenshots has-unit-  |     Focuses:
  tests dev-feedback                             |
-------------------------------------------------+-------------------------

Comment (by craigfrancis):

 I'm wondering if
 [https://core.trac.wordpress.org/attachment/ticket/50855/50855.2.diff
 50855.2.diff] is the correct approach, by simply moving the current
 `array( '%20', '+' )` up 1 line, so both ways of URL encoding a space are
 handled before `$special_chars`.

 As to the failing `wp_unique_filename()` test, that is:

 {{{
 wp_unique_filename( $testdir, '12%af34567890#~!@#$..%^&*()|_+qwerty
 fgh`jkl zx<>?:"{}[]="\'/?.png' )

 -'12af34567890 at ..^_qwerty-fghjkl-zx.png'
 +'12af34567890 at ..^_-qwerty-fghjkl-zx.png'
 }}}

 The extra `'-'` is coming from the `'+'` before `qwerty`, so maybe just
 update the expected test result?

 Tonya, I agree this could be a breaking change (e.g. while this usually
 has other issues as well, a developer could save files using
 `sanitize_file_name()` and store the original file names in their
 database)... but there has kinda been a breaking change already, where the
 feature that replaces `'%20'` and `'+'` so they became `'-'` was broken (I
 presume WordPress wanted to visually separate words in a file name, even
 when the name was URL encoded).

 An alternative would be to simply remove that line, and don't handle the
 URL encoding of spaces.

 ---

 Tangent stuff...

 Tonya, thanks for finding [https://core.trac.wordpress.org/changeset/35122
 Changeset 35122], the addition of `'%'` and `'+'` in `$special_chars` via
 [https://core.trac.wordpress.org/changeset/35122 Changeset 35122] did
 break this feature (oddly `test_replace_spaces()` explicitly tests `'multi
 %20 +space.png' => 'multi-20-space.png'`).

 For the [https://github.com/WordPress/wordpress-develop/pull/1873 patch
 from Tonya], the `'%20'` replacement happens before `$special_chars` on
 [https://github.com/WordPress/wordpress-
 develop/blob/7120c9cda9c9994e54d3432b967a85bc2c383025/src/wp-
 includes/formatting.php#L2020 line 2020]; but keeps the `'+'` replacement
 on [https://github.com/WordPress/wordpress-
 develop/blob/7120c9cda9c9994e54d3432b967a85bc2c383025/src/wp-
 includes/formatting.php#L2022 line 2022], I where don't think that will do
 anything (Peter has come to the same conclusion).

 I'm going to trust the `$special_chars` is complete (I'm not sure if there
 are other problematic characters)... but either way, they can be replaced
 by a hyphen or be removed. Both replacement methods are safe, and I'm
 pretty sure it's just personal taste on which one to use (e.g.
 "MyFile(1).jpg", should that be "MyFile1.jpg" or "MyFile-1.jpg"?). For now
 I'd prefer to stay with `'%20'` and `'+'`, with a separate discussion on
 what other characters could be replaced by a hyphen.

 To answer your questions, I think the `'+'` should be replaced with a
 `'-'`, because that seems to be the intended and original behaviour (why
 the `array( '%20', '+' )` was added in the first place). Secondly, I don't
 think leaving `'20'` was by design (a half encoded value), and I cannot
 think of any issue with replacing `'%20'` with a `'-'` in a function that
 intentionally removes bad characters from a file name.

 With PHP, the `urlencode()` function follows RFC 1738 and its use of URI
 defined in RFC 1630 (ref
 [https://datatracker.ietf.org/doc/html/rfc1630#page-7 page 7], which says
 "Within the query string, the plus sign is reserved as shorthand notation
 for a space."), this is why a space is often encoded a space to `'+'`
 ([https://github.com/php/php-
 src/blob/48e07076209c6163ecb4a756f73f511f1f09200e/ext/standard/url.c#L536
 source]), whereas `rawurlencode()` follows RFC 3986 and encodes a space to
 `'%20'` (i.e. it drops the special case). In both cases we are assuming
 the file name has been URL encoded.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/50855#comment:14>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list