[wp-trac] [WordPress Trac] #22363: Accents in attachment filenames should be sanitized

WordPress Trac noreply at wordpress.org
Thu Nov 14 20:23:53 UTC 2013


#22363: Accents in attachment filenames should be sanitized
----------------------------------------+------------------
 Reporter:  tar.gz                      |       Owner:
     Type:  defect (bug)                |      Status:  new
 Priority:  normal                      |   Milestone:  3.8
Component:  Upload                      |     Version:  3.4
 Severity:  normal                      |  Resolution:
 Keywords:  has-patch needs-unit-tests  |
----------------------------------------+------------------

Comment (by p_enrique):

 I've been experimenting with this issue. Now, I know I'm repeating most of
 the things said above, but I'd like to confirm the following:

 1. As for filenames:
   - Non-ASCII filenames will break on Windows. This is fundamentally a PHP
 bug.
   - *nix filesystems allow any characters in filenames (minus the reserved
 characters, of course).
   - WordPress allows uploaded files to have any characters in their names
 (minus some reserved characters).
   - UTF-8 filenames work OK on WordPress on a *nix platform.
   - Percent-encoded filenames work on any platform, but make the filename
 6 times longer in certain scripts, such as Cyrillic ('%D0%B6' for the
 character 'ж', for example).
   - remove_accents() is only a partial solution since 1) it has no support
 at all for non-Latin based scripts and 2) it doesn't do anything with such
 things as curly quotes, dashes, copyright symbols or other similar things
 (which are included in sanitize_title_with_dashes())

 2. As for URLs:
   - The only allowed characters in URLs according to
 [http://tools.ietf.org/rfc/rfc3986.txt RFC3986] are ''only alphanumerics,
 the special characters "$-_.+!*'(),", and reserved characters used for
 their reserved purposes''. Everything else should be encoded.
   - The attachment anchor href and img src attributes are '''not'''
 encoded on WP.
   - Not encoding the above may break things in some browsers.
   - Not encoding an URI may break other software that recognizes pasted
 URIs.
   - An encoded anchor href is handled transparently by modern browsers.
 For examples, see [http://ru.wikipedia.org The Russian Wikipedia], where
 hovering on a link shows Cyrillic text but in the source, the URI is
 %-encoded.

--
Ticket URL: <http://core.trac.wordpress.org/ticket/22363#comment:32>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list