[wp-trac] [WordPress Trac] #22402: Stripping non-alphanumeric multi-byte characters from slugs
WordPress Trac
noreply at wordpress.org
Sat Nov 10 05:07:10 UTC 2012
#22402: Stripping non-alphanumeric multi-byte characters from slugs
-----------------------------+-------------------------
Reporter: johnbillion | Type: enhancement
Status: new | Priority: normal
Milestone: Awaiting Review | Component: Formatting
Version: | Severity: normal
Keywords: |
-----------------------------+-------------------------
`sanitize_title_with_dashes()` strips non-alphanumeric characters from a
title to create a slug. Unfortunately it only strips ASCII non-
alphanumeric characters. Apart from a few exceptions, all multi-byte
characters are preserved. This means all non-Western (and plenty of
Western) non-alphanumeric characters end up in the slug as they're treated
just like any other multi-byte character.
As an example, here are some common non-alphanumeric Chinese characters
which would ideally be stripped from slugs, but are not:
* 。 (U+3002, Ideographic Full Stop, %E3%80%82)
* , (U+FF0C, Fullwidth Comma, %EF%BC%8C)
* ! (U+FF01, Fullwidth Exclamation Mark, %EF%BC%81)
* : (U+FF1A, Fullwidth Colon, %EF%BC%9A)
* 《 (U+300A, Left Double Angle Bracket, %E3%80%8A)
* 》 (U+300B, Right Double Angle Bracket, %E3%80%8B)
Obviously it would be impractical to make a list of ''all'' the non-ASCII
characters we want to strip from slugs. The list would be gigantic.
So the question is, would it be possible to use Unicode ranges to
blacklist (or whitelist) whole ranges of characters to be stripped from
(or preserved in) slugs? Is this practical or even desirable?
Or would it make more sense to continue using a list of just the most
common multi-byte characters to be stripped?
The latter makes a whole lot more sense, but the former is a more complete
solution.
Thoughts?
--
Ticket URL: <http://core.trac.wordpress.org/ticket/22402>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list