[wp-trac] [WordPress Trac] #57301: Emoji feature detection is incorrect
WordPress Trac
noreply at wordpress.org
Tue Dec 13 20:05:28 UTC 2022
#57301: Emoji feature detection is incorrect
---------------------------+--------------------------------------
Reporter: sergiomdgomes | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Emoji | Version: trunk
Severity: normal | Resolution:
Keywords: | Focuses: javascript, performance
---------------------------+--------------------------------------
Comment (by dmsnell):
> you picked the values from a correct, older test, instead of one of the
new tests that were introduced since #47852 and which effectively broke
the feature-detection.
Ah right, and that makes sense. I didn't see that we have directly encoded
the integers above 0xFFFF in the newer patch.
> So what this patch does is make code points 0x010000 – 0x10FFFF actually
work
For curiosity sake, these code points would have worked before, but we
would have had to split them up, as we did for U+1F170, into the code
units, as is still the case for the tests for the UN flag and the English
flag (they are also represented by these higher code points). That is,
instead of `String.fromCharCode.apply(null, [0x1F1F3])` it would have had
to have been `String.fromCharCode.apply(null, [0xD83C, 0xDDF3])` - these
represent the same strings.
> That seems like the logical best solution
Yeah I agree unless someone has a reason why this wouldn't work. In some
cases, like the test where I saw the use of a zero-width-space I can
understand why seeing the code points in their numeric form to be helpful.
But then again, we can still have that in a string with `'🇺\u200b🇳'`
Or maybe this is the whole point, because now it looks funny. 🤷♂️
> However, we'd still need to be defensive regarding feature support.
Namely, we'd need to determine what the failure mode is for browsers that
don't support unicode escape sequences, and we'd probably need to feature-
detect that lack of support and default to polyfilling emoji, similar to
what my current patch does when it detects a lack of support for
String.fromCodePoint.
I hope I haven't delayed this patch, but I do think it's clear we're
discussing two different issues. "Unicode support" should be entirely
independent from JavaScript or JavaScript version. Essentially what we are
wanting to ask is "are these Unicode Code Points rendered as expected in
the browser," and this patch addresses a separate issue: a bug in the
detection script.
That bug is that someone changed representation of the input test sets
//from// a list of UTF-16 Code Units //to// a list of UTF-16 Code Points
and for those code points that require surrogate pairs in UTF-16, this
broke because they didn't update `fromCharCode` to `fromCodePoint`.
Food for thought: we can fix this bug without excluding older browser
versions that don't support `fromCodePoint`. It seems like a reasonable
choice to exclude them, but it's not necessary to fix the problem, and if
we fix the problem we can retain full support without excluding anyone:
change the supplementary plane code points into code unit sequences.
{{{
isIdentical = emojiSetsRenderIdentically(
- [0x1FAF1, 0x1F3FB, 0x200D,
0x1FAF2, 0x1F3FF],
- [0x1FAF1, 0x1F3FB, 0x200B,
0x1FAF2, 0x1F3FF]
+ [0xD83C, 0xDEF1, 0xD83C, 0xDFFB,
0x200D, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF],
+ [0xD83C, 0xDEF1, 0xD83C, 0xDFFB,
0x200B, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF]
);
}}}
We can do this in a modern browser with the following one-liner
{{{
((input) => '[' + String.fromCodePoint.apply(null, input).split('').map(c
=> '0x' + c.charCodeAt(0).toString(16).padStart(4,
'0').toUpperCase()).join(', ') + ']')([0x1FAF1, 0x1F3FB, 0x200D, 0x1FAF2,
0x1F3FF])
}}}
Three things we could do to prevent further future breakage:
- leave a comment explaining that these need to be UTF-16 code units
according to JavaScript string representation
- `throw` if `emojiSetsRenderIdentically` receives a number above 0xFFFF
in hopes of alerting a developer earlier in the process
- perform the conversion automatically in a way older browsers can handle
{{{
function emojiSetsRenderIdentically( set1, set2 ) {
var stringFromSet = function( set ) {
var i, codeUnits = [];
for (i = 0; i < set.length; i++) {
if (set[i] <= 0xFFFF) {
codeUnits.push(set[i])
} else {
// Split large code points into their
UTF-16 surrogate pairs
// because we need this to create a
JavaScript string.
codeUnits.push(
0xD800 - (0x10000 >> 10) + (set[i]
>> 10),
0xDC00 + (set[i] & 0x3FF)
);
}
}
return String.fromCharCode.apply(null, codeUnits);
}
// Cleanup from previous test.
context.clearRect( 0, 0, canvas.width, canvas.height );
context.fillText( stringFromSet( set1 ), 0, 0 );
var rendered1 = canvas.toDataURL();
// Cleanup from previous test.
context.clearRect( 0, 0, canvas.width, canvas.height );
context.fillText( stringFromSet( this, set2 ), 0, 0 );
var rendered2 = canvas.toDataURL();
return rendered1 === rendered2;
}
}}}
This little extra code safeguards everything so it'll all be what we
expect. Of course, looking at that file `wp-emoji-loader.js`, I can see
that some sets are still listing surrogate pairs while others are listing
single code points; maybe we should make everything consistent while we're
here and try to avoid future confusion on the same //point//?
--
Ticket URL: <https://core.trac.wordpress.org/ticket/57301#comment:10>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list