[wp-trac] [WordPress Trac] #6697: Percent Encoding in URL should be capital alphabets instead of small letters

WordPress Trac wp-trac at lists.automattic.com
Sat Apr 12 13:23:42 GMT 2008


#6697: Percent Encoding in URL should be capital alphabets instead of small
letters
---------------------+------------------------------------------------------
 Reporter:  akky     |       Owner:  anonymous                
     Type:  defect   |      Status:  new                      
 Priority:  normal   |   Milestone:  2.5.1                    
Component:  General  |     Version:  2.5                      
 Severity:  major    |    Keywords:  UTF-8, URL, normalization
---------------------+------------------------------------------------------
 [http://trac.wordpress.org/browser/trunk/wp-includes/formatting.php wp-
 includes/formatting.php] utf8_uri_encode()

 When you use [Options]-[Permalinks]-[Common Options]-[Date and name
 based], or %postname% in custom URL, entry title will be normalized for
 fitting to URL permitted letters.

 If title has non-ASCII letters, those letters cannot be directly put in
 URL so they are percent-encoded. This is processed in
 sanitize_title_with_dashes_original() and utf8_uri_encode() .

 The problem is, these two functions normalizes too much for replacing to
 small letters.

 For example, in unit test data, "Zhang Ziyi(in Chinese)" is currently
 converted to "%e7%ab%a0%e5%ad%90%e6%80%a1", however, this '''should''' be
 "%E7%AB%A0%E5%AD%90%E6%80%A1".

 Currently, WordPress generates title-in-url by url-encode, then apply
 strtolower(), so everything goes to be small letters.

 The reason capital letters in percent encoding required is described on
 [http://rfc.net/rfc3986.html#s2.1. RFC3986 section 2.1],

 {{{
 The uppercase hexadecimal digits 'A' through 'F' are equivalent to   the
 lowercase digits 'a' through 'f', respectively.  If two URIs differ only
 in the case of hexadecimal digits used in percent-encoded octets, they are
 equivalent.  '''For consistency, URI producers and normalizers should use
 uppercase hexadecimal digits for all percent-encodings.'''
 }}}

 As you see [http://rfc.net/rfc2119.html#s3. RFC2119 section 3], "should"
 means you may ignore it with valid reason. So if this was designed by
 WordPress's policy, I think I cannot force the fixation.

 ----

 Now, let me explain why I, and other Japanese WordPress users needs this
 fix. In Japanese blogosphere, there is a Japanese del.icio.us equivalent,
 Social Bookmark Service 'Hatena Bookmark'.

 This service accept users' bookmarks via several different ways like
 bookmarklet, widget, on site, user making tools. Some of them uses
 WordPress URL as is, whilst some others normalize to capital letters.

 As a result of that, number of bookmarks for blog entries written on
 WordPress tend to be sparse. Interfered becoming popular items. When you
 compare against other blog system users blogs. (Search engine recognizes
 encoded URL titles so using this option is needed for Japanese users.)

 If WordPress always generates capital encoded URL (not for roman alphabet
 parts, they are better staying small letters), this problem will be solved
 so that we can encourage more potential users migrates to WordPress.

 I do not know if there are similar issues in other non-ASCII languages.
 And fixing this may make backword-incompatibility with old generated
 entries. However, this is a really big trouble for us Japanese users. This
 is rather WordPress marketing issue than RFC validness.

-- 
Ticket URL: <http://trac.wordpress.org/ticket/6697>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list