[wp-trac] [WordPress Trac] #57301: Emoji feature detection is incorrect

Tue Dec 13 20:05:28 UTC 2022

#57301: Emoji feature detection is incorrect
---------------------------+--------------------------------------
 Reporter:  sergiomdgomes  |       Owner:  (none)
     Type:  defect (bug)   |      Status:  new
 Priority:  normal         |   Milestone:  Awaiting Review
Component:  Emoji          |     Version:  trunk
 Severity:  normal         |  Resolution:
 Keywords:                 |     Focuses:  javascript, performance
---------------------------+--------------------------------------

Comment (by dmsnell):

 > you picked the values from a correct, older test, instead of one of the
 new tests that were introduced since #47852 and which effectively broke
 the feature-detection.

 Ah right, and that makes sense. I didn't see that we have directly encoded
 the integers above 0xFFFF in the newer patch.

 > So what this patch does is make code points 0x010000 – 0x10FFFF actually
 work

 For curiosity sake, these code points would have worked before, but we
 would have had to split them up, as we did for U+1F170, into the code
 units, as is still the case for the tests for the UN flag and the English
 flag (they are also represented by these higher code points). That is,
 instead of `String.fromCharCode.apply(null, [0x1F1F3])` it would have had
 to have been `String.fromCharCode.apply(null, [0xD83C, 0xDDF3])` - these
 represent the same strings.

 > That seems like the logical best solution

 Yeah I agree unless someone has a reason why this wouldn't work. In some
 cases, like the test where I saw the use of a zero-width-space I can
 understand why seeing the code points in their numeric form to be helpful.
 But then again, we can still have that in a string with `'🇺\u200b🇳'`

 Or maybe this is the whole point, because now it looks funny. 🤷‍♂️

 > However, we'd still need to be defensive regarding feature support.
 Namely, we'd need to determine what the failure mode is for browsers that
 don't support unicode escape sequences, and we'd probably need to feature-
 detect that lack of support and default to polyfilling emoji, similar to
 what my current patch does when it detects a lack of support for
 String.fromCodePoint.

 I hope I haven't delayed this patch, but I do think it's clear we're
 discussing two different issues. "Unicode support" should be entirely
 independent from JavaScript or JavaScript version. Essentially what we are
 wanting to ask is "are these Unicode Code Points rendered as expected in
 the browser," and this patch addresses a separate issue: a bug in the
 detection script.

 That bug is that someone changed representation of the input test sets
 //from// a list of UTF-16 Code Units //to// a list of UTF-16 Code Points
 and for those code points that require surrogate pairs in UTF-16, this
 broke because they didn't update `fromCharCode` to `fromCodePoint`.

 Food for thought: we can fix this bug without excluding older browser
 versions that don't support `fromCodePoint`. It seems like a reasonable
 choice to exclude them, but it's not necessary to fix the problem, and if
 we fix the problem we can retain full support without excluding anyone:
 change the supplementary plane code points into code unit sequences.

 {{{
                                 isIdentical = emojiSetsRenderIdentically(
 -                                       [0x1FAF1, 0x1F3FB, 0x200D,
 0x1FAF2, 0x1F3FF],
 -                                       [0x1FAF1, 0x1F3FB, 0x200B,
 0x1FAF2, 0x1F3FF]
 +                                       [0xD83C, 0xDEF1, 0xD83C, 0xDFFB,
 0x200D, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF],
 +                                       [0xD83C, 0xDEF1, 0xD83C, 0xDFFB,
 0x200B, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF]
                                 );
 }}}

 We can do this in a modern browser with the following one-liner
 {{{
 ((input) => '[' + String.fromCodePoint.apply(null, input).split('').map(c
 => '0x' + c.charCodeAt(0).toString(16).padStart(4,
 '0').toUpperCase()).join(', ') + ']')([0x1FAF1, 0x1F3FB, 0x200D, 0x1FAF2,
 0x1F3FF])
 }}}

 Three things we could do to prevent further future breakage:
  - leave a comment explaining that these need to be UTF-16 code units
 according to JavaScript string representation
  - `throw` if `emojiSetsRenderIdentically` receives a number above 0xFFFF
 in hopes of alerting a developer earlier in the process
  - perform the conversion automatically in a way older browsers can handle

 {{{
 function emojiSetsRenderIdentically( set1, set2 ) {
         var stringFromSet = function( set ) {
                 var i, codeUnits = [];

                 for (i = 0; i < set.length; i++) {
                         if (set[i] <= 0xFFFF) {
                                 codeUnits.push(set[i])
                         } else {
                                 // Split large code points into their
 UTF-16 surrogate pairs
                                 // because we need this to create a
 JavaScript string.
                                 codeUnits.push(
                                         0xD800 - (0x10000 >> 10) + (set[i]
 >> 10),
                                         0xDC00 + (set[i] & 0x3FF)
                                 );
                         }
                 }

                 return String.fromCharCode.apply(null, codeUnits);
         }

         // Cleanup from previous test.
         context.clearRect( 0, 0, canvas.width, canvas.height );
         context.fillText( stringFromSet( set1 ), 0, 0 );
         var rendered1 = canvas.toDataURL();

         // Cleanup from previous test.
         context.clearRect( 0, 0, canvas.width, canvas.height );
         context.fillText( stringFromSet( this, set2 ), 0, 0 );
         var rendered2 = canvas.toDataURL();

         return rendered1 === rendered2;
 }
 }}}

 This little extra code safeguards everything so it'll all be what we
 expect. Of course, looking at that file `wp-emoji-loader.js`, I can see
 that some sets are still listing surrogate pairs while others are listing
 single code points; maybe we should make everything consistent while we're
 here and try to avoid future confusion on the same //point//?

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/57301#comment:10>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform