[wp-trac] [WordPress Trac] #47014: Tag balancing corrupts Custom Elements

Tue Apr 23 16:49:48 UTC 2019

#47014: Tag balancing corrupts Custom Elements
--------------------------------------+------------------------
 Reporter:  westonruter               |       Owner:  flixos90
     Type:  defect (bug)              |      Status:  reviewing
 Priority:  normal                    |   Milestone:  5.3
Component:  Formatting                |     Version:  2.0.4
 Severity:  normal                    |  Resolution:
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+------------------------

Comment (by dmsnell):

 Here are some thoughts from a parsing standpoint; these are my
 interpretations of the standards and may be wrong.

  - Custom elements are not HTML5 tags as
 [https://www.w3.org/TR/html53/syntax.html#tag-name the spec] only allows
 alphanumeric characters in a tag name.
  - Custom elements //are not a standard// yet but rather a draft at this
 point still
  - The Custom Element draft spec is
 [https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-
 name significantly more permissive] for names than HTML tag names

 This patch supports //some// cases of custom element names but leaves most
 still unsupported. For example, using the example from the draft spec
 `<emotion-😍>` fools the pattern into thinking that the custom element
 name is `emotion-` and its attributes are `😍`. I believe that this will
 result in creating a closing tag `</emotion->` which is also incorrect.

 It appears that this code is trying to find tag-like entities in the
 document and loop over them, performing semantic checks at each opening or
 closing. It might be a good time to not only patch over a single instance
 where this fails, but to consider improving the overall strength of the
 matching pattern to eliminate a number of other errors involved. For the
 purposes of `force_balance_tags()` it seems safe to treat custom elements
 the same way as HTML tags.

 Note that you exposed another glaring bug in the original pattern, only
 appearing now because we're relaxing the restrictions on the tag name: the
 `\s*` matches even when there are no spaces separating the element name
 from the attribute list.

 While ugly, this pattern will match custom element names as currently
 proposed:
 {{{#!php
 # See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-
 element-name
 $tag_like_element =
 '#<(\/?[-.0-9_a-z\xb7\xc0-\xd6\xd8-\xf6\xf8-\x{37d}\x{37f}-\x{1fff}\x{200c}-\x{200d}\x{203f}-\x{2040}\x{2070}-\x{218f}\x{2c00}-\x{2fef}\x{3001}-\x{d7ff}\x{f900}-\x{fdcf}\x{fdf0}-\x{fffd}\x{10000}-\x{effff}]*)\s*([^>]*)>#iu';
 }}}

 There's a small problem here and that is that it now breaks HTML5 tags, as
 custom elements are forbidden from having uppercase letters in their name
 while HTML tags aren't.

 > They do not contain any uppercase ASCII letters, ensuring that the user
 agent can always treat HTML elements ASCII-case-insensitively.
 [https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-
 name Notes in draft spec]

 As a practical step it seems valuable to expand the allowable character
 set beyond the basic `-` since we expect in the wild to find custom
 elements with names that take advantage of their allowances. Any change
 here, even the proposed change, does incur some breakage on existing HTML
 tag names but the risk is that we'll allow names that are supposed to be
 invalid, possibly treating something that //isn't// really a tag or
 element as if it were.

 [https://regex101.com/r/FQ9s9G/1 some quick examples on regex101.com]

 ----

 For real fun we might consider creating a more robust parser that can
 differentiate and prevent the bugs, but I doubt anyone is going to want to
 do that here and now. 😉

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/47014#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform