[wp-trac] [WordPress Trac] #60170: HTML API: Scan every token in an HTML document.

Wed Jan 24 23:36:01 UTC 2024

#60170: HTML API: Scan every token in an HTML document.
-------------------------+----------------------
 Reporter:  dmsnell      |       Owner:  dmsnell
     Type:  enhancement  |      Status:  closed
 Priority:  normal       |   Milestone:  6.5
Component:  HTML API     |     Version:  trunk
 Severity:  normal       |  Resolution:  fixed
 Keywords:  has-patch    |     Focuses:
-------------------------+----------------------
Changes (by dmsnell):

 * owner:  (none) => dmsnell
 * status:  new => closed
 * resolution:   => fixed

Comment:

 In [changeset:"57348" 57348]:
 {{{
 #!CommitTicketReference repository="" revision="57348"
 HTML API: Scan all syntax tokens in a document, read modifiable text.

 Since its introduction in WordPress 6.2 the HTML Tag Processor has
 provided a way to scan through all of the HTML tags in a document and
 then read and modify their attributes. In order to reliably do this, it
 also needed to be aware of other kinds of HTML syntax, but it didn't
 expose those syntax tokens to consumers of the API.

 In this patch the Tag Processor introduces a new scanning method and a
 few helper methods to read information about or from each token. Most
 significantly, this introduces the ability to read `#text` nodes in the
 document.

 What's new in the Tag Processor?
 ================================

  - `next_token()` visits every distinct syntax token in a document.
  - `get_token_type()` indicates what kind of token it is.
  - `get_token_name()` returns something akin to `DOMNode.nodeName`.
  - `get_modifiable_text()` returns the text associated with a token.
  - `get_comment_type()` indicates why a token represents an HTML comment.

 Example usage.
 ==============

 {{{
 <?php
 function strip_all_tags( $html ) {
         $text_content = '';
         $processor    = new WP_HTML_Tag_Processor( $html );

         while ( $processor->next_token() ) {
                 if ( '#text' !== $processor->get_token_type() ) {
                         continue;
                 }

                 $text_content .= $processor->get_modifiable_text();
         }

         return $text_content;
 }
 }}}

 What changes in the Tag Processor?
 ==================================

 Previously, the Tag Processor would scan the opening and closing tag of
 every HTML element separately. Now, however, there are special tags
 which it only visits once, as if those elements were void tags without
 a closer.

 These are special tags because their content contains no other HTML or
 markup, only non-HTML content.

  - SCRIPT elements contain raw text which is isolated from the rest of
    the HTML document and fed separately into a JavaScript engine. There
    are complicated rules to avoid escaping the script context in the HTML.
    The contents are left verbatim, and character references are not
 decoded.

  - TEXTARA and TITLE elements contain plain text which is decoded
    before display, e.g. transforming `&` into `&`. Any markup which
    resembles tags is treated as verbatim text and not a tag.

  - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the
    textarea and title elements, but no character references are decoded.
    For example, `&` inside a STYLE element is passed to the CSS engine
    as the literal string `&` and _not_ as `&`.

 Because it's important not treat this inner content separately from the
 elements containing it, the Tag Processor combines them when scanning
 into a single match and makes their content available as modifiable
 text (see below).

 This means that the Tag Processor will no longer visit a closing tag for
 any of these elements unless that tag is unexpected.

 {{{
     <title>There is only a single token in this line</title>
     <title>There are two tokens in this line></title></title>
     </title><title>There are still two tokens in this line></title>
 }}}

 What are tokens?
 ================

 The term "token" here is a parsing term, which means a primitive unit in
 HTML. There are only a few kinds of tokens in HTML:

  - a tag has a name, attributes, and a closing or self-closing flag.
  - a text node, or `#text` node contains plain text which is displayed
    in a browser and which is decoded before display.
  - a DOCTYPE declaration indicates how to parse the document.
  - a comment is hidden from the display on a page but present in the HTML.

 There are a few more kinds of tokens that the HTML Tag Processor will
 recognize, some of which don't exist as concepts in HTML. These mostly
 comprise XML syntax elements that aren't part of HTML (such as CDATA and
 processing instructions) and invalid HTML syntax that transforms into
 comments.

 What is a funky comment?
 ========================

 This patch treats a specific kind of invalid comment in a special way.
 A closing tag with an invalid name is considered a "funky comment." In
 the browser these become HTML comments just like any other, but their
 syntax is convenient for representing a variety of bits of information
 in a well-defined way and which cannot be nested or recursive, given
 the parsing rules handling this invalid syntax.

  - `</1>`
  - `</%avatar_url>`
  - `</{"wp_bit": {"type": "post-author"}}>`
  - `</[post-author]>`
  - `</__( 'Save Post' );>`

 All of these examples become HTML comments in the browser. The content
 inside the funky content is easily parsable, whereby the only rule is
 that it starts at the `<` and continues until the nearest `>`. There
 can be no funky comment inside another, because that would imply having
 a `>` inside of one, which would actually terminate the first one.

 What is modifiable text?
 ========================

 Modifiable text is similar to the `innerText` property of a DOM node.
 It represents the span of text for a given token which may be modified
 without changing the structure of the HTML document or the token.

 There is currently no mechanism to change the modifiable text, but this
 is planned to arrive in a later patch.

 Tags
 ====

 Most tags have no modifiable text because they have child nodes where
 text nodes are found. Only the special tags mentioned above have
 modifiable text.

 {{{
     <div class="post">Another day in HTML</div>
     └─ tag ──────────┘└─ text node ─────┘└────┴─ tag
 }}}

 {{{
     <title>Is <img> > <image>?</title>
     │      └ modifiable text ───┘       │ "Is <img> > <image>?"
     └─ tag ─────────────────────────────┘
 }}}

 Text nodes
 ==========

 Text nodes are entirely modifiable text.

 {{{
     This HTML document has no tags.
     └─ modifiable text ───────────┘
 }}}

 Comments
 ========

 The modifiable text inside a comment is the portion of the comment that
 doesn't form its syntax. This applies for a number of invalid comments.

 {{{
     <!-- this is inside a comment -->
     │   └─ modifiable text ──────┘  │
     └─ comment token ───────────────┘
 }}}

 {{{
     <!-->
     This invalid comment has no modifiable text.
 }}}

 {{{
     <? this is an invalid comment -->
     │ └─ modifiable text ────────┘  │
     └─ comment token ───────────────┘
 }}}

 {{{
     <[CDATA[this is an invalid comment]]>
     │       └─ modifiable text ───────┘ │
     └─ comment token ───────────────────┘
 }}}

 Other token types also have modifiable text. Consult the code or tests
 for further information.

 Developed in https://github.com/WordPress/wordpress-develop/pull/5683
 Discussed in https://core.trac.wordpress.org/ticket/60170

 Follows [57575]

 Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam
 Fixes #60170
 }}}

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60170#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform