[wp-hackers] Post slug bug
Ryan Boren
ryan at boren.nu
Sat Jun 2 02:39:47 GMT 2007
On 6/1/07, Dougal Campbell <dougal at gunters.org> wrote:
>
> Do we have an easy way to know which non-ASCII characters are
> non-alphanumeric when we start talking about UTF characters from
> non-English languages? How do we distinguish "word" characters from
> punctuation when we start dealing with Russian, Chinese, Hebrew, etc? If
> there isn't a reliable function for that already (I'm not sure, myself,
> I'll leave a definitive answer up to someone with more i18n experience),
> then the url-encoding mentioned here is still the best way to handle it,
> I think.
>
For UTF-8 strings, we first decompose accented characters in Latin-1
Supplement and Latin Extended-A into their non accented forms. If
mb_strtolower is available, that is run next. Then all characters outside
the ASCII range are converted to encoded octets. Then we strip any
remaining non-ASCII bits. The call sequence is like this:
sanitize_title
sanitize_title_with_dashes
remove_accents
utf8_uri_encode
Non-UTF8 strings don't get the uri encoding. If someone switched from
ISO-8859-1 to UTF-8, that would explain why "¿" is being encoded instead of
stripped.
Ryan
More information about the wp-hackers
mailing list