[wp-hackers] Plugin: Sanitize i18n (UTF-8) titles

Ryan Boren ryan at boren.nu
Wed Sep 8 00:52:08 UTC 2004


On Tue, 2004-09-07 at 17:17 -0500, Ryan Boren wrote:
> On Tue, 2004-09-07 at 01:56 -0500, Matthew Mullenweg wrote:
> > On Sep 6, 2004, at 11:02 PM, Ryan Boren wrote:
> > 
> > > Description: Escapes UTF-8 post, category, and author titles so that
> > > they are suitable for use in URIs.  ASCII characters are preserved as-
> > > is, while other characters are encoded as a sequence of octets
> > > represented in the form %HH, where HH is the hexadecimal representation
> > > of the octet.
> > 
> > Also, what would be the advantages/disadvantages of having this support 
> > in the core? It seems light and useful, and we've already modified the 
> > rules to match.
> 
> I'd like to integrate it with the remove_accents() stuff so that
> accented latin characters are decomposed rather then escaped.  Right
> now, remove_accents() is not non-latin friendly.  We need to have a
> separate list of Unicode decompositions so that we don't perform
> utf8_decode().  I wish PHP had UCD support so that we would have
> collation and decomposition information.  Instead, we'll have to create
> our own small Unicode decomposition table for accented characters in the
> Latin-1 Supplement and Latin Extended-A blocks.  We already map a few
> letters.
> 
> $invalid_latin_chars = array(chr(197).chr(146) => 'OE', chr(197).chr
> (147) => 'oe', chr(197).chr(160) => 'S', chr(197).chr(189) => 'Z', chr
> (197).chr(161) => 's', chr(197).chr(190) => 'z', chr(226).chr(130).chr
> (172) => 'E');
> 
> We'll need to flush this out so that it's at least as complete as the
> ISO-8859-1 map we have in remove_accents().  The Latin collation chart
> will be helpful here.
> 
> http://www.unicode.org/charts/collation/chart_Latin.html
> 
> So, to map À (U+00C0 LATIN CAPITAL LETTER A WITH GRAVE) to A (U+0041
> LATIN CAPITAL LETTER A) we need to add the following to the array:
> 
> chr(195).chr(128) => chr(65)
> 
> or:
> 
> chr(hexdec('c3)).chr(hexdec('80')) => chr(hexdec('41'))
> 
> Anyone feel like adopting their favorite accented character and
> providing a mapping?  You can find the hex UTF-8 representation of the
> character at http://www.macchiato.com/unicode/charts.html.  Just combine
> the column with the row.

Hmmm, if we incorporate this into the core, what should the WordPress
default be?   Escape accented Latin characters, or remove the accents?
If we remove the accents, we lose information.  Most browsers display
the escaped characters properly when you mouse over a link, so you are
insulated a bit from the %HH ugliness.

Matt has a similar discussion going on at his site.

http://photomatt.net/2004/09/07/i%c3%b1t%c3%abrn%c3%a2ti%c3%b4n%c3%a0liz
%c3%a6ti%c3%b8n/

Ryan






More information about the hackers mailing list