[wp-trac] [WordPress Trac] #24661: remove_accents is not removing combining accents
WordPress Trac
noreply at wordpress.org
Sun Sep 18 13:03:24 UTC 2016
#24661: remove_accents is not removing combining accents
------------------------------------+--------------------
Reporter: NumidWasNotAvailable | Owner:
Type: defect (bug) | Status: new
Priority: normal | Milestone: 4.7
Component: Formatting | Version: 1.2.1
Severity: normal | Resolution:
Keywords: has-patch dev-feedback | Focuses:
------------------------------------+--------------------
Comment (by gitlost):
A `_wp_can_use_pcre_u`-like function would be handy, for the unit test for
one thing.
A detail about the ICU Latin-ASCII is that the first thing it does is
globally filter the input with `:: [[:Latin:][:Common:][:Inherited:][〇]]
;`, so the later `\p{Mn}` is actually only matching the combining marks
that make it though the filter, which only occur in `\p{Inherited}`. So
the same intersection could be done here, which reduces the single byte
regex alt to 554 code points, making it a lot less scary:
`define( 'WP_MN_INHERITED_REGEX_ALTS',
'\xcc[\x80-\xbf]|\xcd[\x80-\xaf]|\xd2[\x85\x86]|\xd9[\x8b-\x95\xb0]|\xe0\xa5[\x91\x92]|\xe1(?:\xaa[\xb0-\xbd]|\xb3[\x90-\x92\x94-\xa0\xa2-\xa8\xad\xb4\xb8\xb9]|\xb7[\x80-\xb5\xbb-\xbf])|\xe2\x83[\x90-\x9c\xa1\xa5-\xb0]|\xe3(?:\x80[\xaa-\xad]|\x82[\x99\x9a])|\xef\xb8[\x80-\x8f\xa0-\xad]|\xf0(?:\x90(?:\x87\xbd|\x8b\xa0)|\x9d(?:\x85[\xa7-\xa9\xbb-\xbf]|\x86[\x80-\x82\x85-\x8b\xaa-\xad]))|\xf3\xa0(?:[\x84-\x86][\x80-\xbf]|\x87[\x80-\xaf])'
); // 554 code points.`
It complicates the UCP usage though, eg
`/(?<=\p{Latin})(?:(?=\p{Inherited})\p{Mn})+/u`
--
Ticket URL: <https://core.trac.wordpress.org/ticket/24661#comment:24>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list