[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters
WordPress Trac
noreply at wordpress.org
Mon Jan 20 22:03:03 UTC 2020
#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
Reporter: zodiac1978 | Owner: SergeyBiryukov
Type: enhancement | Status: reviewing
Priority: normal | Milestone: 5.4
Component: Formatting | Version:
Severity: normal | Resolution:
Keywords: dev-feedback has-patch | Focuses:
------------------------------------+-----------------------------
Comment (by zodiac1978):
Replying to [comment:45 a8bit]:
> I just wanted to throw up a contrary view of this ticket.
Hi @a8bit and thank you for your feedback!
> I just spent a day fighting with this problem in reverse. Renaming a
file to a string stored in a mysql database that included a precomposed
character (U+0161) caused the OS (macOS) to convert that character to the
compound form (U+0073 U+030C). WordPress than couldn't find the file
because file_exists() was always false. I had to change the string in the
db to the compound form to get it to work.
That shows IMHO exactly why everything **should be** normalized to NFC.
Because then we have a common ground. macOS is using NFD (decomposed
characters) internally and that's why Safari does normalize files on
upload. But Chrome/Firefox are not doing this. We could wait for the
browsers to fix it or we can fix it in WordPress.
> The Unicode Standard says that
> Many compatibility decomposable characters are included in the
Unicode Standard solely to represent distinctions in other base standards.
They support transmission and processing of legacy data. Their use is
discouraged other than for legacy data or other special circumstances.
>
> Apple now enforces that. I could find no way to use U+016 in my file, it
was forced to the compound form even if I entered the hex directly.
That's correct, because the filesystem itself (HFS+ and APFS for example)
are using NFD and not NFC.
> MSDN also recommends compound characters, saying that
>
> Pre-composed characters may also be decomposed. For example, an
application importing a text file containing the pre-composed character
"ü" may decompose that character into a "u" followed by the non-spacing
character "¨". This allows easy alphabetical sorting for languages where
character modifiers do not affect alphabetical order. The Unicode standard
defines decomposition for all pre-composed characters.
>
> I haven't checked if Windows forces the decomposition or not but
Microsoft clearly thinks you should decompose wherever possible.
Windows doesn't force decomposition and I don't think you should do this
and I can't find your source on MSDN if I google this text. Can you please
share the link, so that I can check the source myself?
> I should also point out that the w3 document linked in the first post of
this issue has been updated since 2014 and the latest version recommends
NFC but admits it's not always appropriate or even available. (see
https://www.w3.org/TR/charmod-norm/#normalizationChoice)
Agreed, but what would be the alternative? We could check and warn the
user, as this is recommended by the document. But as the module with the
needed function is optional that wouldn't be very reliable:
> Authoring tools SHOULD provide a means of normalizing resources and warn
the user when a given resource is not in Unicode Normalization Form C.
or we could normalize locale-specific, because the biggest problem seems
to be that other languages may have a problem with normalization:
> Content authors SHOULD use Unicode Normalization Form C (NFC) wherever
possible for content. Note that NFC is not always appropriate to the
content or even available to content authors in some languages.
I think there are not many cases where you will really need NFD text. The
advantages of a working search, working proofreading, etc. are outweighing
any possible edge cases where the NFD text is needed.
I am still recommending to get this patch in and then see what breaks (if
something breaks).
--
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:46>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list