[wp-trac] [WordPress Trac] #22402: Stripping non-alphanumeric multi-byte characters from slugs

WordPress Trac noreply at wordpress.org
Sat Nov 10 05:07:10 UTC 2012


#22402: Stripping non-alphanumeric multi-byte characters from slugs
-----------------------------+-------------------------
 Reporter:  johnbillion      |       Type:  enhancement
   Status:  new              |   Priority:  normal
Milestone:  Awaiting Review  |  Component:  Formatting
  Version:                   |   Severity:  normal
 Keywords:                   |
-----------------------------+-------------------------
 `sanitize_title_with_dashes()` strips non-alphanumeric characters from a
 title to create a slug. Unfortunately it only strips ASCII non-
 alphanumeric characters. Apart from a few exceptions, all multi-byte
 characters are preserved. This means all non-Western (and plenty of
 Western) non-alphanumeric characters end up in the slug as they're treated
 just like any other multi-byte character.

 As an example, here are some common non-alphanumeric Chinese characters
 which would ideally be stripped from slugs, but are not:

  * 。 (U+3002, Ideographic Full Stop, %E3%80%82)
  * , (U+FF0C, Fullwidth Comma, %EF%BC%8C)
  * ! (U+FF01, Fullwidth Exclamation Mark, %EF%BC%81)
  * : (U+FF1A, Fullwidth Colon, %EF%BC%9A)
  * 《 (U+300A, Left Double Angle Bracket, %E3%80%8A)
  * 》 (U+300B, Right Double Angle Bracket, %E3%80%8B)

 Obviously it would be impractical to make a list of ''all'' the non-ASCII
 characters we want to strip from slugs. The list would be gigantic.

 So the question is, would it be possible to use Unicode ranges to
 blacklist (or whitelist) whole ranges of characters to be stripped from
 (or preserved in) slugs? Is this practical or even desirable?

 Or would it make more sense to continue using a list of just the most
 common multi-byte characters to be stripped?

 The latter makes a whole lot more sense, but the former is a more complete
 solution.

 Thoughts?

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/22402>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list