[wp-hackers] Post slug bug

Sat Jun 2 02:39:47 GMT 2007

On 6/1/07, Dougal Campbell <dougal at gunters.org> wrote:
>
> Do we have an easy way to know which non-ASCII characters are
> non-alphanumeric when we start talking about UTF characters from
> non-English languages? How do we distinguish "word" characters from
> punctuation when we start dealing with Russian, Chinese, Hebrew, etc? If
> there isn't a reliable function for that already (I'm not sure, myself,
> I'll leave a definitive answer up to someone with more i18n experience),
> then the url-encoding mentioned here is still the best way to handle it,
> I think.
>

For UTF-8 strings, we first decompose accented  characters  in  Latin-1
Supplement  and  Latin Extended-A into their non accented forms.  If
mb_strtolower is available, that is run next.  Then all characters outside
the ASCII range are converted to encoded octets.  Then we strip any
remaining non-ASCII bits.  The call sequence is like this:

sanitize_title
sanitize_title_with_dashes
remove_accents
utf8_uri_encode

Non-UTF8 strings don't get the uri encoding.  If someone switched from
ISO-8859-1 to UTF-8, that would explain why "¿" is being encoded instead of
stripped.

Ryan