[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters
WordPress Trac
noreply at wordpress.org
Tue Jan 21 08:22:41 UTC 2020
#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
Reporter: zodiac1978 | Owner: SergeyBiryukov
Type: enhancement | Status: reviewing
Priority: normal | Milestone: 5.4
Component: Formatting | Version:
Severity: normal | Resolution:
Keywords: dev-feedback has-patch | Focuses:
------------------------------------+-----------------------------
Comment (by zodiac1978):
Replying to [comment:47 a8bit]:
> IMO it shows that everything **should be** normalized, just not
necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is
described by Unicode as for legacy systems. The future appears to be NFD.
That is not true. NFC is not described as for legacy systems. The linked
document shows that NFD/NFC are just two different ways of doing it.
> > That's correct, because the filesystem itself (HFS+ and APFS for
example) are using NFD and not NFC.
>
> This means if all text in WordPress is normalized to NFC any file
comparisons with files on APFS that have multi-byte characters is going to
fail.
No, it means IMHO we have to normalize every input to NFC (as this is the
recommendation from the W3C) to have a common ground. This is exactly the
reason why normalization exists - to make comparison working again in
those cases.
> I solved my problem today by writing a function to check the existence
of files using both forms, doubling the file io's in the process. Not
exactly optimal.
If everything is NFC there is no need for this anymore.
> > Windows doesn't force decomposition and I don't think you should do
this and I can't find your source on MSDN if I google this text. Can you
please share the link, so that I can check the source myself?
>
> It was quoted as a source on the wikipedia page for precomposed
characters http://msdn.microsoft.com/en-us/library/aa911606.aspx
This text is from 2010 and outdated and for Windows Embedded CE.
> > Agreed, but what would be the alternative? We could check and warn the
user, as this is recommended by the document. But as the module with the
needed function is optional that wouldn't be very reliable:
>
> The alternative would be NFD.
We would have the same things to do because the problem exists in the same
way in the other direction. Many other OS are using NFC.
> > or we could normalize locale-specific, because the biggest problem
seems to be that other languages may have a problem with normalization:
>
> That would be great if no one ever read a website outside of their own
country
>
> > I think there are not many cases where you will really need NFD text.
The advantages of a working search, working proofreading, etc. are
outweighing any possible edge cases where the NFD text is needed.
>
> They said that about 4-digit years ;)
That's both very funny, but you are missing to provide a solution to the
mentioned problems.
> I could mention that search and sort becomes more flexible with NFD
because you can now choose to do those things with and without the
compound characters, I don't see how proofreading is improved with NFC?
It is broken with NFD. Please see my talk and watch the slides where I
show all the problems:
https://wordpress.tv/2019/08/28/torsten-landsiedel-special-characters-and-
where-to-find-them/
> > I am still recommending to get this patch in and then see what breaks
(if something breaks).
>
> I hope it all goes well, I don't have any skin in this game I was merely
flagging up one of the edge cases I actually hit today in case no one had
thought of it. Apple not allowing NFC is going to cause issues for
international macOS users when comparing source and destination data, it
remains to be seen how big of an issue that will be but I accept it's
likely to be quite small.
macOS is using NFD *internally* and it knows that, so every native API is
normalizing text to NFC (as this is what is coming from a keyboard in most
cases). For example Safari does normalize text to NFC on input or uploads.
If you are using just native APIs everything is fine, but if NFD is
getting through we have a problem. Firefox and Chrome are NOT normalizing
on input or upload and that's creating such problems.
We also maybe need do make a difference between normalizing URLs,
normalizing filenames and normalizing content. Maybe we end up with a
different approach for filenames, but as I look at
https://core.trac.wordpress.org/ticket/24661 it seems to be the best
solution to normalize to NFC here too.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:48>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list