[wp-trac] [WordPress Trac] #24661: remove_accents is not removing combining accents

WordPress Trac noreply at wordpress.org
Sun Sep 18 13:03:24 UTC 2016


#24661: remove_accents is not removing combining accents
------------------------------------+--------------------
 Reporter:  NumidWasNotAvailable    |       Owner:
     Type:  defect (bug)            |      Status:  new
 Priority:  normal                  |   Milestone:  4.7
Component:  Formatting              |     Version:  1.2.1
 Severity:  normal                  |  Resolution:
 Keywords:  has-patch dev-feedback  |     Focuses:
------------------------------------+--------------------

Comment (by gitlost):

 A `_wp_can_use_pcre_u`-like function would be handy, for the unit test for
 one thing.

 A detail about the ICU Latin-ASCII is that the first thing it does is
 globally filter the input with `:: [[:Latin:][:Common:][:Inherited:][〇]]
 ;`, so the later `\p{Mn}` is actually only matching the combining marks
 that make it though the filter, which only occur in `\p{Inherited}`. So
 the same intersection could be done here, which reduces the single byte
 regex alt to 554 code points, making it a lot less scary:

 `define( 'WP_MN_INHERITED_REGEX_ALTS',
 '\xcc[\x80-\xbf]|\xcd[\x80-\xaf]|\xd2[\x85\x86]|\xd9[\x8b-\x95\xb0]|\xe0\xa5[\x91\x92]|\xe1(?:\xaa[\xb0-\xbd]|\xb3[\x90-\x92\x94-\xa0\xa2-\xa8\xad\xb4\xb8\xb9]|\xb7[\x80-\xb5\xbb-\xbf])|\xe2\x83[\x90-\x9c\xa1\xa5-\xb0]|\xe3(?:\x80[\xaa-\xad]|\x82[\x99\x9a])|\xef\xb8[\x80-\x8f\xa0-\xad]|\xf0(?:\x90(?:\x87\xbd|\x8b\xa0)|\x9d(?:\x85[\xa7-\xa9\xbb-\xbf]|\x86[\x80-\x82\x85-\x8b\xaa-\xad]))|\xf3\xa0(?:[\x84-\x86][\x80-\xbf]|\x87[\x80-\xaf])'
 ); // 554 code points.`

 It complicates the UCP usage though, eg
 `/(?<=\p{Latin})(?:(?=\p{Inherited})\p{Mn})+/u`

--
Ticket URL: <https://core.trac.wordpress.org/ticket/24661#comment:24>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list