[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters

Tue Jan 21 08:22:41 UTC 2020

#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
 Reporter:  zodiac1978              |       Owner:  SergeyBiryukov
     Type:  enhancement             |      Status:  reviewing
 Priority:  normal                  |   Milestone:  5.4
Component:  Formatting              |     Version:
 Severity:  normal                  |  Resolution:
 Keywords:  dev-feedback has-patch  |     Focuses:
------------------------------------+-----------------------------

Comment (by zodiac1978):

 Replying to [comment:47 a8bit]:

 > IMO it shows that everything **should be** normalized, just not
 necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is
 described by Unicode as for legacy systems. The future appears to be NFD.

 That is not true. NFC is not described as for legacy systems. The linked
 document shows that NFD/NFC are just two different ways of doing it.

 > > That's correct, because the filesystem itself (HFS+ and APFS for
 example) are using NFD and not NFC.
 >
 > This means if all text in WordPress is normalized to NFC any file
 comparisons with files on APFS that have multi-byte characters is going to
 fail.

 No, it means IMHO we have to normalize every input to NFC (as this is the
 recommendation from the W3C) to have a common ground. This is exactly the
 reason why normalization exists - to make comparison working again in
 those cases.

 > I solved my problem today by writing a function to check the existence
 of files using both forms, doubling the file io's in the process. Not
 exactly optimal.

 If everything is NFC there is no need for this anymore.

 > > Windows doesn't force decomposition and I don't think you should do
 this and I can't find your source on MSDN if I google this text. Can you
 please share the link, so that I can check the source myself?
 >
 > It was quoted as a source on the wikipedia page for precomposed
 characters http://msdn.microsoft.com/en-us/library/aa911606.aspx

 This text is from 2010 and outdated and for Windows Embedded CE.

 > > Agreed, but what would be the alternative? We could check and warn the
 user, as this is recommended by the document. But as the module with the
 needed function is optional that wouldn't be very reliable:
 >
 > The alternative would be NFD.

 We would have the same things to do because the problem exists in the same
 way in the other direction. Many other OS are using NFC.

 > > or we could normalize locale-specific, because the biggest problem
 seems to be that other languages may have a problem with normalization:
 >
 > That would be great if no one ever read a website outside of their own
 country
 >
 > > I think there are not many cases where you will really need NFD text.
 The advantages of a working search, working proofreading, etc. are
 outweighing any possible edge cases where the NFD text is needed.
 >
 > They said that about 4-digit years ;)

 That's both very funny, but you are missing to provide a solution to the
 mentioned problems.

 > I could mention that search and sort becomes more flexible with NFD
 because you can now choose to do those things with and without the
 compound characters, I don't see how proofreading is improved with NFC?

 It is broken with NFD. Please see my talk and watch the slides where I
 show all the problems:
 https://wordpress.tv/2019/08/28/torsten-landsiedel-special-characters-and-
 where-to-find-them/

 > > I am still recommending to get this patch in and then see what breaks
 (if something breaks).
 >
 > I hope it all goes well, I don't have any skin in this game I was merely
 flagging up one of the edge cases I actually hit today in case no one had
 thought of it. Apple not allowing NFC is going to cause issues for
 international macOS users when comparing source and destination data, it
 remains to be seen how big of an issue that will be but I accept it's
 likely to be quite small.

 macOS is using NFD *internally* and it knows that, so every native API is
 normalizing text to NFC (as this is what is coming from a keyboard in most
 cases). For example Safari does normalize text to NFC on input or uploads.
 If you are using just native APIs everything is fine, but if NFD is
 getting through we have a problem. Firefox and Chrome are NOT normalizing
 on input or upload and that's creating such problems.

 We also maybe need do make a difference between normalizing URLs,
 normalizing filenames and normalizing content. Maybe we end up with a
 different approach for filenames, but as I look at
 https://core.trac.wordpress.org/ticket/24661 it seems to be the best
 solution to normalize to NFC here too.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:48>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform