[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters

WordPress Trac noreply at wordpress.org
Mon Jan 20 18:25:41 UTC 2020


#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
 Reporter:  zodiac1978              |       Owner:  SergeyBiryukov
     Type:  enhancement             |      Status:  reviewing
 Priority:  normal                  |   Milestone:  5.4
Component:  Formatting              |     Version:
 Severity:  normal                  |  Resolution:
 Keywords:  dev-feedback has-patch  |     Focuses:
------------------------------------+-----------------------------

Comment (by a8bit):

 I just wanted to throw up a contrary view of this ticket.

 I just spent a day fighting with this problem in reverse. Renaming a file
 to a string stored in a mysql database that included a precomposed
 character (U+0161) caused the OS (macOS) to convert that character to the
 compound form (U+0073 U+030C). WordPress than couldn't find the file
 because file_exists() was always false. I had to change the string in the
 db to the compound form to get it to work.

 The Unicode Standard says that
   Many  compatibility  decomposable  characters  are  included  in  the
 Unicode Standard solely to represent distinctions in other base standards.
 They support transmission and processing of legacy data. Their use is
 discouraged other than for legacy data or other special circumstances.

 Apple now enforces that. I could find no way to use U+016 in my file, it
 was forced to the compound form even if I entered the hex directly.

 MSDN also recommends compound characters, saying that

    Pre-composed characters may also be decomposed. For example, an
 application importing a text file containing the pre-composed character
 "ü" may decompose that character into a "u" followed by the non-spacing
 character "¨". This allows easy alphabetical sorting for languages where
 character modifiers do not affect alphabetical order. The Unicode standard
 defines decomposition for all pre-composed characters.

 I haven't checked if Windows forces the decomposition or not but Microsoft
 clearly thinks you should decompose wherever possible.

 I should also point out that the w3 document linked in the first post of
 this issue has been updated since 2014 and the latest version recommends
 NFC but admits it's not always appropriate or even available. (see
 https://www.w3.org/TR/charmod-norm/#normalizationChoice)

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:45>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list