[wp-hackers] Plugin: Sanitize i18n (UTF-8) titles

Ryan Boren ryan at boren.nu
Tue Sep 7 22:17:34 UTC 2004


On Tue, 2004-09-07 at 01:56 -0500, Matthew Mullenweg wrote:
> On Sep 6, 2004, at 11:02 PM, Ryan Boren wrote:
> 
> > Description: Escapes UTF-8 post, category, and author titles so that
> > they are suitable for use in URIs.  ASCII characters are preserved as-
> > is, while other characters are encoded as a sequence of octets
> > represented in the form %HH, where HH is the hexadecimal representation
> > of the octet.
> 
> Also, what would be the advantages/disadvantages of having this support 
> in the core? It seems light and useful, and we've already modified the 
> rules to match.

I'd like to integrate it with the remove_accents() stuff so that
accented latin characters are decomposed rather then escaped.  Right
now, remove_accents() is not non-latin friendly.  We need to have a
separate list of Unicode decompositions so that we don't perform
utf8_decode().  I wish PHP had UCD support so that we would have
collation and decomposition information.  Instead, we'll have to create
our own small Unicode decomposition table for accented characters in the
Latin-1 Supplement and Latin Extended-A blocks.  We already map a few
letters.

$invalid_latin_chars = array(chr(197).chr(146) => 'OE', chr(197).chr
(147) => 'oe', chr(197).chr(160) => 'S', chr(197).chr(189) => 'Z', chr
(197).chr(161) => 's', chr(197).chr(190) => 'z', chr(226).chr(130).chr
(172) => 'E');

We'll need to flush this out so that it's at least as complete as the
ISO-8859-1 map we have in remove_accents().  The Latin collation chart
will be helpful here.

http://www.unicode.org/charts/collation/chart_Latin.html

So, to map À (U+00C0 LATIN CAPITAL LETTER A WITH GRAVE) to A (U+0041
LATIN CAPITAL LETTER A) we need to add the following to the array:

chr(195).chr(128) => chr(65)

or:

chr(hexdec('c3)).chr(hexdec('80')) => chr(hexdec('41'))

Anyone feel like adopting their favorite accented character and
providing a mapping?  You can find the hex UTF-8 representation of the
character at http://www.macchiato.com/unicode/charts.html.  Just combine
the column with the row.

Ryan




More information about the hackers mailing list