[wp-trac] [WordPress Trac] #6697: Percent Encoding in URL should be
capital alphabets instead of small letters
WordPress Trac
wp-trac at lists.automattic.com
Sat Apr 12 13:23:42 GMT 2008
#6697: Percent Encoding in URL should be capital alphabets instead of small
letters
---------------------+------------------------------------------------------
Reporter: akky | Owner: anonymous
Type: defect | Status: new
Priority: normal | Milestone: 2.5.1
Component: General | Version: 2.5
Severity: major | Keywords: UTF-8, URL, normalization
---------------------+------------------------------------------------------
[http://trac.wordpress.org/browser/trunk/wp-includes/formatting.php wp-
includes/formatting.php] utf8_uri_encode()
When you use [Options]-[Permalinks]-[Common Options]-[Date and name
based], or %postname% in custom URL, entry title will be normalized for
fitting to URL permitted letters.
If title has non-ASCII letters, those letters cannot be directly put in
URL so they are percent-encoded. This is processed in
sanitize_title_with_dashes_original() and utf8_uri_encode() .
The problem is, these two functions normalizes too much for replacing to
small letters.
For example, in unit test data, "Zhang Ziyi(in Chinese)" is currently
converted to "%e7%ab%a0%e5%ad%90%e6%80%a1", however, this '''should''' be
"%E7%AB%A0%E5%AD%90%E6%80%A1".
Currently, WordPress generates title-in-url by url-encode, then apply
strtolower(), so everything goes to be small letters.
The reason capital letters in percent encoding required is described on
[http://rfc.net/rfc3986.html#s2.1. RFC3986 section 2.1],
{{{
The uppercase hexadecimal digits 'A' through 'F' are equivalent to the
lowercase digits 'a' through 'f', respectively. If two URIs differ only
in the case of hexadecimal digits used in percent-encoded octets, they are
equivalent. '''For consistency, URI producers and normalizers should use
uppercase hexadecimal digits for all percent-encodings.'''
}}}
As you see [http://rfc.net/rfc2119.html#s3. RFC2119 section 3], "should"
means you may ignore it with valid reason. So if this was designed by
WordPress's policy, I think I cannot force the fixation.
----
Now, let me explain why I, and other Japanese WordPress users needs this
fix. In Japanese blogosphere, there is a Japanese del.icio.us equivalent,
Social Bookmark Service 'Hatena Bookmark'.
This service accept users' bookmarks via several different ways like
bookmarklet, widget, on site, user making tools. Some of them uses
WordPress URL as is, whilst some others normalize to capital letters.
As a result of that, number of bookmarks for blog entries written on
WordPress tend to be sparse. Interfered becoming popular items. When you
compare against other blog system users blogs. (Search engine recognizes
encoded URL titles so using this option is needed for Japanese users.)
If WordPress always generates capital encoded URL (not for roman alphabet
parts, they are better staying small letters), this problem will be solved
so that we can encourage more potential users migrates to WordPress.
I do not know if there are similar issues in other non-ASCII languages.
And fixing this may make backword-incompatibility with old generated
entries. However, this is a really big trouble for us Japanese users. This
is rather WordPress marketing issue than RFC validness.
--
Ticket URL: <http://trac.wordpress.org/ticket/6697>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list