[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters

WordPress Trac noreply at wordpress.org
Mon Jan 20 22:03:03 UTC 2020


#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
 Reporter:  zodiac1978              |       Owner:  SergeyBiryukov
     Type:  enhancement             |      Status:  reviewing
 Priority:  normal                  |   Milestone:  5.4
Component:  Formatting              |     Version:
 Severity:  normal                  |  Resolution:
 Keywords:  dev-feedback has-patch  |     Focuses:
------------------------------------+-----------------------------

Comment (by zodiac1978):

 Replying to [comment:45 a8bit]:
 > I just wanted to throw up a contrary view of this ticket.

 Hi @a8bit and thank you for your feedback!

 > I just spent a day fighting with this problem in reverse. Renaming a
 file to a string stored in a mysql database that included a precomposed
 character (U+0161) caused the OS (macOS) to convert that character to the
 compound form (U+0073 U+030C). WordPress than couldn't find the file
 because file_exists() was always false. I had to change the string in the
 db to the compound form to get it to work.

 That shows IMHO exactly why everything **should be** normalized to NFC.
 Because then we have a common ground. macOS is using NFD (decomposed
 characters) internally and that's why Safari does normalize files on
 upload. But Chrome/Firefox are not doing this. We could wait for the
 browsers to fix it or we can fix it in WordPress.


 > The Unicode Standard says that
 >   Many  compatibility  decomposable  characters  are  included  in  the
 Unicode Standard solely to represent distinctions in other base standards.
 They support transmission and processing of legacy data. Their use is
 discouraged other than for legacy data or other special circumstances.
 >
 > Apple now enforces that. I could find no way to use U+016 in my file, it
 was forced to the compound form even if I entered the hex directly.

 That's correct, because the filesystem itself (HFS+ and APFS for example)
 are using NFD and not NFC.

 > MSDN also recommends compound characters, saying that
 >
 >    Pre-composed characters may also be decomposed. For example, an
 application importing a text file containing the pre-composed character
 "ü" may decompose that character into a "u" followed by the non-spacing
 character "¨". This allows easy alphabetical sorting for languages where
 character modifiers do not affect alphabetical order. The Unicode standard
 defines decomposition for all pre-composed characters.
 >
 > I haven't checked if Windows forces the decomposition or not but
 Microsoft clearly thinks you should decompose wherever possible.

 Windows doesn't force decomposition and I don't think you should do this
 and I can't find your source on MSDN if I google this text. Can you please
 share the link, so that I can check the source myself?

 > I should also point out that the w3 document linked in the first post of
 this issue has been updated since 2014 and the latest version recommends
 NFC but admits it's not always appropriate or even available. (see
 https://www.w3.org/TR/charmod-norm/#normalizationChoice)

 Agreed, but what would be the alternative? We could check and warn the
 user, as this is recommended by the document. But as the module with the
 needed function is optional that wouldn't be very reliable:

 > Authoring tools SHOULD provide a means of normalizing resources and warn
 the user when a given resource is not in Unicode Normalization Form C.

 or we could normalize locale-specific, because the biggest problem seems
 to be that other languages may have a problem with normalization:

 > Content authors SHOULD use Unicode Normalization Form C (NFC) wherever
 possible for content. Note that NFC is not always appropriate to the
 content or even available to content authors in some languages.

 I think there are not many cases where you will really need NFD text. The
 advantages of a working search, working proofreading, etc. are outweighing
 any possible edge cases where the NFD text is needed.

 I am still recommending to get this patch in and then see what breaks (if
 something breaks).

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:46>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list