[wp-trac] [WordPress Trac] #47014: Tag balancing corrupts Custom Elements
WordPress Trac
noreply at wordpress.org
Tue Apr 23 16:49:48 UTC 2019
#47014: Tag balancing corrupts Custom Elements
--------------------------------------+------------------------
Reporter: westonruter | Owner: flixos90
Type: defect (bug) | Status: reviewing
Priority: normal | Milestone: 5.3
Component: Formatting | Version: 2.0.4
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+------------------------
Comment (by dmsnell):
Here are some thoughts from a parsing standpoint; these are my
interpretations of the standards and may be wrong.
- Custom elements are not HTML5 tags as
[https://www.w3.org/TR/html53/syntax.html#tag-name the spec] only allows
alphanumeric characters in a tag name.
- Custom elements //are not a standard// yet but rather a draft at this
point still
- The Custom Element draft spec is
[https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-
name significantly more permissive] for names than HTML tag names
This patch supports //some// cases of custom element names but leaves most
still unsupported. For example, using the example from the draft spec
`<emotion-😍>` fools the pattern into thinking that the custom element
name is `emotion-` and its attributes are `😍`. I believe that this will
result in creating a closing tag `</emotion->` which is also incorrect.
It appears that this code is trying to find tag-like entities in the
document and loop over them, performing semantic checks at each opening or
closing. It might be a good time to not only patch over a single instance
where this fails, but to consider improving the overall strength of the
matching pattern to eliminate a number of other errors involved. For the
purposes of `force_balance_tags()` it seems safe to treat custom elements
the same way as HTML tags.
Note that you exposed another glaring bug in the original pattern, only
appearing now because we're relaxing the restrictions on the tag name: the
`\s*` matches even when there are no spaces separating the element name
from the attribute list.
While ugly, this pattern will match custom element names as currently
proposed:
{{{#!php
# See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-
element-name
$tag_like_element =
'#<(\/?[-.0-9_a-z\xb7\xc0-\xd6\xd8-\xf6\xf8-\x{37d}\x{37f}-\x{1fff}\x{200c}-\x{200d}\x{203f}-\x{2040}\x{2070}-\x{218f}\x{2c00}-\x{2fef}\x{3001}-\x{d7ff}\x{f900}-\x{fdcf}\x{fdf0}-\x{fffd}\x{10000}-\x{effff}]*)\s*([^>]*)>#iu';
}}}
There's a small problem here and that is that it now breaks HTML5 tags, as
custom elements are forbidden from having uppercase letters in their name
while HTML tags aren't.
> They do not contain any uppercase ASCII letters, ensuring that the user
agent can always treat HTML elements ASCII-case-insensitively.
[https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-
name Notes in draft spec]
As a practical step it seems valuable to expand the allowable character
set beyond the basic `-` since we expect in the wild to find custom
elements with names that take advantage of their allowances. Any change
here, even the proposed change, does incur some breakage on existing HTML
tag names but the risk is that we'll allow names that are supposed to be
invalid, possibly treating something that //isn't// really a tag or
element as if it were.
[https://regex101.com/r/FQ9s9G/1 some quick examples on regex101.com]
----
For real fun we might consider creating a more robust parser that can
differentiate and prevent the bugs, but I doubt anyone is going to want to
do that here and now. 😉
--
Ticket URL: <https://core.trac.wordpress.org/ticket/47014#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list