[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters

WordPress Trac noreply at wordpress.org
Mon Oct 27 22:36:32 UTC 2014


#30130: Normalize characters with combining marks to precomposed characters
-------------------------+-----------------------------
 Reporter:  zodiac1978   |      Owner:
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Awaiting Review
Component:  General      |    Version:  trunk
 Severity:  normal       |   Keywords:
  Focuses:               |
-------------------------+-----------------------------
 I ran into a little weird problem which I wanted to solve. And here it is:

 I have a PDF file with German Umlauts (üöäÜÖÄ) and if I copy & paste them
 into WordPress I get the vowel (uoaUOA) which followed by a diaeresis
 (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of
 just one precomposed character.

 This results in some problems:
 - Search for words with umlauts doesn't work
 - Proofreading fails
 - W3C validation fails with warning "Text run is not in Unicode
 Normalization Form C." because precomposed characters are prefered (See:
 http://www.w3.org/International/docs/charmod-norm/#choice-of-
 normalization-form)

 Solution: I made a proof-of-concept with the "content_save_pre" filter and
 it works. In this proof-of-concept I just replaced the two characters with
 the precomposed character:

 '''$content = str_replace( "a\xCC\x88", "ä", $content );
 $content = str_replace( "o\xCC\x88", "ö", $content );
 $content = str_replace( "u\xCC\x88", "ü", $content );
 $content = str_replace( "A\xCC\x88", "Ä", $content );
 $content = str_replace( "O\xCC\x88", "Ö", $content );
 $content = str_replace( "U\xCC\x88", "Ü", $content );'''

 If we could (I know we can't, because WP is still supporting PHP 5.2) rely
 on PHP 5.3+ we could use a function for that:
 http://php.net/manual/de/normalizer.normalize.php

 So the above code (also used in the upcoming patch) would be just one line
 and much more general:
 '''$content = normalizer_normalize($content, Normalizer::FORM_C );'''

 Fun facts:
 The problem is just on Mac OS X (Lion, 10.7.5) for me (on Ubuntu 14.04 or
 Win 7 I couldn't reproduce the problem).

 Maybe this is an edge case and/or plugin territory.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/30130>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list