<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>[57348] trunk: HTML API: Scan all syntax tokens in a document, read modifiable text.</title>
</head>
<body>

<style type="text/css"><!--
#msg dl.meta { border: 1px #006 solid; background: #369; padding: 6px; color: #fff; }
#msg dl.meta dt { float: left; width: 6em; font-weight: bold; }
#msg dt:after { content:':';}
#msg dl, #msg dt, #msg ul, #msg li, #header, #footer, #logmsg { font-family: verdana,arial,helvetica,sans-serif; font-size: 10pt;  }
#msg dl a { font-weight: bold}
#msg dl a:link    { color:#fc3; }
#msg dl a:active  { color:#ff0; }
#msg dl a:visited { color:#cc6; }
h3 { font-family: verdana,arial,helvetica,sans-serif; font-size: 10pt; font-weight: bold; }
#msg pre { white-space: pre-line; overflow: auto; background: #ffc; border: 1px #fa0 solid; padding: 6px; }
#logmsg { background: #ffc; border: 1px #fa0 solid; padding: 1em 1em 0 1em; }
#logmsg p, #logmsg pre, #logmsg blockquote { margin: 0 0 1em 0; }
#logmsg p, #logmsg li, #logmsg dt, #logmsg dd { line-height: 14pt; }
#logmsg h1, #logmsg h2, #logmsg h3, #logmsg h4, #logmsg h5, #logmsg h6 { margin: .5em 0; }
#logmsg h1:first-child, #logmsg h2:first-child, #logmsg h3:first-child, #logmsg h4:first-child, #logmsg h5:first-child, #logmsg h6:first-child { margin-top: 0; }
#logmsg ul, #logmsg ol { padding: 0; list-style-position: inside; margin: 0 0 0 1em; }
#logmsg ul { text-indent: -1em; padding-left: 1em; }#logmsg ol { text-indent: -1.5em; padding-left: 1.5em; }
#logmsg > ul, #logmsg > ol { margin: 0 0 1em 0; }
#logmsg pre { background: #eee; padding: 1em; }
#logmsg blockquote { border: 1px solid #fa0; border-left-width: 10px; padding: 1em 1em 0 1em; background: white;}
#logmsg dl { margin: 0; }
#logmsg dt { font-weight: bold; }
#logmsg dd { margin: 0; padding: 0 0 0.5em 0; }
#logmsg dd:before { content:'\00bb';}
#logmsg table { border-spacing: 0px; border-collapse: collapse; border-top: 4px solid #fa0; border-bottom: 1px solid #fa0; background: #fff; }
#logmsg table th { text-align: left; font-weight: normal; padding: 0.2em 0.5em; border-top: 1px dotted #fa0; }
#logmsg table td { text-align: right; border-top: 1px dotted #fa0; padding: 0.2em 0.5em; }
#logmsg table thead th { text-align: center; border-bottom: 1px solid #fa0; }
#logmsg table th.Corner { text-align: left; }
#logmsg hr { border: none 0; border-top: 2px dashed #fa0; height: 1px; }
#header, #footer { color: #fff; background: #636; border: 1px #300 solid; padding: 6px; }
#patch { width: 100%; }
#patch h4 {font-family: verdana,arial,helvetica,sans-serif;font-size:10pt;padding:8px;background:#369;color:#fff;margin:0;}
#patch .propset h4, #patch .binary h4 {margin:0;}
#patch pre {padding:0;line-height:1.2em;margin:0;}
#patch .diff {width:100%;background:#eee;padding: 0 0 10px 0;overflow:auto;}
#patch .propset .diff, #patch .binary .diff  {padding:10px 0;}
#patch span {display:block;padding:0 10px;}
#patch .modfile, #patch .addfile, #patch .delfile, #patch .propset, #patch .binary, #patch .copfile {border:1px solid #ccc;margin:10px 0;}
#patch ins {background:#dfd;text-decoration:none;display:block;padding:0 10px;}
#patch del {background:#fdd;text-decoration:none;display:block;padding:0 10px;}
#patch .lines, .info {color:#888;background:#fff;}
--></style>
<div id="msg">
<dl class="meta" style="font-size: 105%">
<dt style="float: left; width: 6em; font-weight: bold">Revision</dt> <dd><a style="font-weight: bold" href="https://core.trac.wordpress.org/changeset/57348">57348</a><script type="application/ld+json">{"@context":"http://schema.org","@type":"EmailMessage","description":"Review this Commit","action":{"@type":"ViewAction","url":"https://core.trac.wordpress.org/changeset/57348","name":"Review Commit"}}</script></dd>
<dt style="float: left; width: 6em; font-weight: bold">Author</dt> <dd>dmsnell</dd>
<dt style="float: left; width: 6em; font-weight: bold">Date</dt> <dd>2024-01-24 23:35:46 +0000 (Wed, 24 Jan 2024)</dd>
</dl>

<pre style='padding-left: 1em; margin: 2em 0; border-left: 2px solid #ccc; line-height: 1.25; font-size: 105%; font-family: sans-serif'>HTML API: Scan all syntax tokens in a document, read modifiable text.

Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.

In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read `#text` nodes in the
document.

What's new in the Tag Processor?
================================

 - `next_token()` visits every distinct syntax token in a document.
 - `get_token_type()` indicates what kind of token it is.
 - `get_token_name()` returns something akin to `DOMNode.nodeName`.
 - `get_modifiable_text()` returns the text associated with a token.
 - `get_comment_type()` indicates why a token represents an HTML comment.

Example usage.
==============

{{{
<?php
function strip_all_tags( $html ) {
        $text_content = '';
        $processor    = new WP_HTML_Tag_Processor( $html );

        while ( $processor->next_token() ) {
                if ( '#text' !== $processor->get_token_type() ) {
                        continue;
                }

                $text_content .= $processor->get_modifiable_text();
        }

        return $text_content;
}
}}}

What changes in the Tag Processor?
==================================

Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.

These are special tags because their content contains no other HTML or
markup, only non-HTML content.

 - SCRIPT elements contain raw text which is isolated from the rest of
   the HTML document and fed separately into a JavaScript engine. There
   are complicated rules to avoid escaping the script context in the HTML.
   The contents are left verbatim, and character references are not decoded.

 - TEXTARA and TITLE elements contain plain text which is decoded
   before display, e.g. transforming `&amp;` into `&`. Any markup which
   resembles tags is treated as verbatim text and not a tag.

 - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the
   textarea and title elements, but no character references are decoded.
   For example, `&amp;` inside a STYLE element is passed to the CSS engine
   as the literal string `&amp;` and _not_ as `&`.

Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).

This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.

{{{
    <title>There is only a single token in this line</title>
    <title>There are two tokens in this line></title></title>
    </title><title>There are still two tokens in this line></title>
}}}

What are tokens?
================

The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:

 - a tag has a name, attributes, and a closing or self-closing flag.
 - a text node, or `#text` node contains plain text which is displayed
   in a browser and which is decoded before display.
 - a DOCTYPE declaration indicates how to parse the document.
 - a comment is hidden from the display on a page but present in the HTML.

There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.

What is a funky comment?
========================

This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.

 - `</1>`
 - `</%avatar_url>`
 - `</{"wp_bit": {"type": "post-author"}}>`
 - `</[post-author]>`
 - `</__( 'Save Post' );>`

All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the `<` and continues until the nearest `>`. There
can be no funky comment inside another, because that would imply having
a `>` inside of one, which would actually terminate the first one.

What is modifiable text?
========================

Modifiable text is similar to the `innerText` property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.

There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.

Tags
====

Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.

{{{
    <div class="post">Another day in HTML</div>
    {U+2514}{U+2500} tag {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}{U+2514}{U+2500} text node {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}{U+2514}{U+2500}{U+2500}{U+2500}{U+2500}{U+2534}{U+2500} tag
}}}

{{{
    <title>Is <img> &gt; <image>?</title>
    {U+2502}      {U+2514} modifiable text {U+2500}{U+2500}{U+2500}{U+2518}       {U+2502} "Is <img> > <image>?"
    {U+2514}{U+2500} tag {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}
}}}

Text nodes
==========

Text nodes are entirely modifiable text.

{{{
    This HTML document has no tags.
    {U+2514}{U+2500} modifiable text {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}
}}}

Comments
========

The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.

{{{
    <!-- this is inside a comment -->
    {U+2502}   {U+2514}{U+2500} modifiable text {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}  {U+2502}
    {U+2514}{U+2500} comment token {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}
}}}

{{{
    <!-->
    This invalid comment has no modifiable text.
}}}

{{{
    <? this is an invalid comment -->
    {U+2502} {U+2514}{U+2500} modifiable text {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}  {U+2502}
    {U+2514}{U+2500} comment token {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}
}}}

{{{
    <[CDATA[this is an invalid comment]]>
    {U+2502}       {U+2514}{U+2500} modifiable text {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518} {U+2502}
    {U+2514}{U+2500} comment token {U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2500}{U+2518}
}}}

Other token types also have modifiable text. Consult the code or tests
for further information.

Developed in https://github.com/WordPress/wordpress-develop/pull/5683
Discussed in https://core.trac.wordpress.org/ticket/60170

Follows <a href="https://core.trac.wordpress.org/changeset/57575">[57575]</a>

Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam
Fixes <a href="https://core.trac.wordpress.org/ticket/60170">#60170</a></pre>

<h3>Modified Paths</h3>
<ul>
<li><a href="#trunksrcwpincludeshtmlapiclasswphtmlprocessorphp">trunk/src/wp-includes/html-api/class-wp-html-processor.php</a></li>
<li><a href="#trunksrcwpincludeshtmlapiclasswphtmltagprocessorphp">trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php</a></li>
<li><a href="#trunktestsphpunittestshtmlapiwpHtmlProcessorBreadcrumbsphp">trunk/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php</a></li>
<li><a href="#trunktestsphpunittestshtmlapiwpHtmlTagProcessorphp">trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php</a></li>
</ul>

<h3>Added Paths</h3>
<ul>
<li><a href="#trunktestsphpunittestshtmlapiwpHtmlTagProcessortokenscanningphp">trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php</a></li>
</ul>

</div>
<div id="patch">
<h3>Diff</h3>
<a id="trunksrcwpincludeshtmlapiclasswphtmlprocessorphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/src/wp-includes/html-api/class-wp-html-processor.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/src/wp-includes/html-api/class-wp-html-processor.php        2024-01-24 19:25:00 UTC (rev 57347)
+++ trunk/src/wp-includes/html-api/class-wp-html-processor.php  2024-01-24 23:35:46 UTC (rev 57348)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -150,17 +150,6 @@
</span><span class="cx" style="display: block; padding: 0 10px">        const MAX_BOOKMARKS = 100;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * Static query for instructing the Tag Processor to visit every token.
-        *
-        * @access private
-        *
-        * @since 6.4.0
-        *
-        * @var array
-        */
-       const VISIT_EVERYTHING = array( 'tag_closers' => 'visit' );
-
-       /**
</del><span class="cx" style="display: block; padding: 0 10px">          * Holds the working state of the parser, including the stack of
</span><span class="cx" style="display: block; padding: 0 10px">         * open elements and the stack of active formatting elements.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -425,6 +414,30 @@
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * Ensures internal accounting is maintained for HTML semantic rules while
+        * the underlying Tag Processor class is seeking to a bookmark.
+        *
+        * This doesn't currently have a way to represent non-tags and doesn't process
+        * semantic rules for text nodes. For access to the raw tokens consider using
+        * WP_HTML_Tag_Processor instead.
+        *
+        * @since 6.5.0 Added for internal support; do not use.
+        *
+        * @access private
+        *
+        * @return bool
+        */
+       public function next_token() {
+               $found_a_token = parent::next_token();
+
+               if ( '#tag' === $this->get_token_type() ) {
+                       $this->step( self::REPROCESS_CURRENT_NODE );
+               }
+
+               return $found_a_token;
+       }
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * Indicates if the currently-matched tag matches the given breadcrumbs.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * A "*" represents a single tag wildcard, where any tag matches, but not no tags.
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -520,7 +533,9 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                $this->state->stack_of_open_elements->pop();
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        parent::next_tag( self::VISIT_EVERYTHING );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 while ( parent::next_token() && '#tag' !== $this->get_token_type() ) {
+                               continue;
+                       }
</ins><span class="cx" style="display: block; padding: 0 10px">                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Finish stepping when there are no more tokens in the document.
</span></span></pre></div>
<a id="trunksrcwpincludeshtmlapiclasswphtmltagprocessorphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php    2024-01-24 19:25:00 UTC (rev 57347)
+++ trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php      2024-01-24 23:35:46 UTC (rev 57348)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -247,6 +247,95 @@
</span><span class="cx" style="display: block; padding: 0 10px">  *         }
</span><span class="cx" style="display: block; padding: 0 10px">  *     }
</span><span class="cx" style="display: block; padding: 0 10px">  *
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+ * ## Tokens and finer-grained processing.
+ *
+ * It's possible to scan through every lexical token in the
+ * HTML document using the `next_token()` function. This
+ * alternative form takes no argument and provides no built-in
+ * query syntax.
+ *
+ * Example:
+ *
+ *      $title = '(untitled)';
+ *      $text  = '';
+ *      while ( $processor->next_token() ) {
+ *          switch ( $processor->get_token_name() ) {
+ *              case '#text':
+ *                  $text .= $processor->get_modifiable_text();
+ *                  break;
+ *
+ *              case 'BR':
+ *                  $text .= "\n";
+ *                  break;
+ *
+ *              case 'TITLE':
+ *                  $title = $processor->get_modifiable_text();
+ *                  break;
+ *          }
+ *      }
+ *      return trim( "# {$title}\n\n{$text}" );
+ *
+ * ### Tokens and _modifiable text_.
+ *
+ * #### Special "atomic" HTML elements.
+ *
+ * Not all HTML elements are able to contain other elements inside of them.
+ * For instance, the contents inside a TITLE element are plaintext (except
+ * that character references like &amp; will be decoded). This means that
+ * if the string `<img>` appears inside a TITLE element, then it's not an
+ * image tag, but rather it's text describing an image tag. Likewise, the
+ * contents of a SCRIPT or STYLE element are handled entirely separately in
+ * a browser than the contents of other elements because they represent a
+ * different language than HTML.
+ *
+ * For these elements the Tag Processor treats the entire sequence as one,
+ * from the opening tag, including its contents, through its closing tag.
+ * This means that the it's not possible to match the closing tag for a
+ * SCRIPT element unless it's unexpected; the Tag Processor already matched
+ * it when it found the opening tag.
+ *
+ * The inner contents of these elements are that element's _modifiable text_.
+ *
+ * The special elements are:
+ *  - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy
+ *    style of including Javascript inside of HTML comments to avoid accidentally
+ *    closing the SCRIPT from inside a Javascript string. E.g. `console.log( '</script>' )`.
+ *  - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any
+ *    character references are decoded. E.g. `1 &lt; 2 < 3` becomes `1 < 2 < 3`.
+ *  - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as
+ *    raw plaintext and left as-is. E.g. `1 &lt; 2 < 3` remains `1 &lt; 2 < 3`.
+ *
+ * #### Other tokens with modifiable text.
+ *
+ * There are also non-elements which are void/self-closing in nature and contain
+ * modifiable text that is part of that individual syntax token itself.
+ *
+ *  - `#text` nodes, whose entire token _is_ the modifiable text.
+ *  - HTML comments and tokens that become comments due to some syntax error. The
+ *    text for these tokens is the portion of the comment inside of the syntax.
+ *    E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
+ *  - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
+ *    `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]).
+ *  - "Funky comments," which are a special case of invalid closing tags whose name is
+ *    invalid. The text for these nodes is the text that a browser would transform into
+ *    an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`.
+ *  - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag.
+ *  - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]).
+ *  - The empty end tag `</>` which is ignored in the browser and DOM.
+ *
+ * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything
+ *      until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA
+ *      section in an HTML document containing `>`. The Tag Processor will first find
+ *      all valid and bogus HTML comments, and then if the comment _would_ have been a
+ *      CDATA section _were they to exist_, it will indicate this as the type of comment.
+ *
+ * [2]: XML allows a broader range of characters in a processing instruction's target name
+ *      and disallows "xml" as a name, since it's special. The Tag Processor only recognizes
+ *      target names with an ASCII-representable subset of characters. It also exhibits the
+ *      same constraint as with CDATA sections, in that `>` cannot exist within the token
+ *      since Processing Instructions do no exist within HTML and their syntax transforms
+ *      into a bogus comment in the DOM.
+ *
</ins><span class="cx" style="display: block; padding: 0 10px">  * ## Design and limitations
</span><span class="cx" style="display: block; padding: 0 10px">  *
</span><span class="cx" style="display: block; padding: 0 10px">  * The Tag Processor is designed to linearly scan HTML documents and tokenize
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -320,7 +409,8 @@
</span><span class="cx" style="display: block; padding: 0 10px">  * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive.
</span><span class="cx" style="display: block; padding: 0 10px">  * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE.
</span><span class="cx" style="display: block; padding: 0 10px">  * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token.
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">- *              Introduces "special" elements which act like void elements, e.g. STYLE.
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+ *              Introduces "special" elements which act like void elements, e.g. TITLE, STYLE.
+ *              Allows scanning through all tokens and processing modifiable text, where applicable.
</ins><span class="cx" style="display: block; padding: 0 10px">  */
</span><span class="cx" style="display: block; padding: 0 10px"> class WP_HTML_Tag_Processor {
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -396,25 +486,49 @@
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="cx" style="display: block; padding: 0 10px">         * Specifies mode of operation of the parser at any given time.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * | State         | Meaning                                                              |
-        * | --------------|----------------------------------------------------------------------|
-        * | *Ready*       | The parser is ready to run.                                          |
-        * | *Complete*    | There is nothing left to parse.                                      |
-        * | *Incomplete*  | The HTML ended in the middle of a token; nothing more can be parsed. |
-        * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes.           |
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * | State           | Meaning                                                              |
+        * | ----------------|----------------------------------------------------------------------|
+        * | *Ready*         | The parser is ready to run.                                          |
+        * | *Complete*      | There is nothing left to parse.                                      |
+        * | *Incomplete*    | The HTML ended in the middle of a token; nothing more can be parsed. |
+        * | *Matched tag*   | Found an HTML tag; it's possible to modify its attributes.           |
+        * | *Text node*     | Found a #text node; this is plaintext and modifiable.                |
+        * | *CDATA node*    | Found a CDATA section; this is modifiable.                           |
+        * | *Comment*       | Found a comment or bogus comment; this is modifiable.                |
+        * | *Presumptuous*  | Found an empty tag closer: `</>`.                                    |
+        * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable.     |
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * @since 6.5.0
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * @see WP_HTML_Tag_Processor::STATE_READY
</span><span class="cx" style="display: block; padding: 0 10px">         * @see WP_HTML_Tag_Processor::STATE_COMPLETE
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT
</ins><span class="cx" style="display: block; padding: 0 10px">          * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE
+        * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE
+        * @see WP_HTML_Tag_Processor::STATE_COMMENT
+        * @see WP_HTML_Tag_Processor::STATE_DOCTYPE
+        * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG
+        * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * @var string
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-        private $parser_state = self::STATE_READY;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+ protected $parser_state = self::STATE_READY;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * What kind of syntax token became an HTML comment.
+        *
+        * Since there are many ways in which HTML syntax can create an HTML comment,
+        * this indicates which of those caused it. This allows the Tag Processor to
+        * represent more from the original input document than would appear in the DOM.
+        *
+        * @since 6.5.0
+        *
+        * @var string|null
+        */
+       protected $comment_type = null;
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * How many bytes from the original HTML document have been read and parsed.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * This value points to the latest byte offset in the input document which
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -491,6 +605,24 @@
</span><span class="cx" style="display: block; padding: 0 10px">        private $tag_name_length;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * Byte offset into input document where current modifiable text starts.
+        *
+        * @since 6.5.0
+        *
+        * @var int
+        */
+       private $text_starts_at;
+
+       /**
+        * Byte length of modifiable text.
+        *
+        * @since 6.5.0
+        *
+        * @var string
+        */
+       private $text_length;
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * @var bool
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -705,13 +837,13 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @return bool Whether a token was parsed.
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        public function next_token() {
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                $was_at = $this->bytes_already_parsed;
</ins><span class="cx" style="display: block; padding: 0 10px">                 $this->get_updated_html();
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $was_at = $this->bytes_already_parsed;
</del><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Don't proceed if there's nothing more to scan.
</span><span class="cx" style="display: block; padding: 0 10px">                if (
</span><span class="cx" style="display: block; padding: 0 10px">                        self::STATE_COMPLETE === $this->parser_state ||
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        self::STATE_INCOMPLETE === $this->parser_state
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 self::STATE_INCOMPLETE_INPUT === $this->parser_state
</ins><span class="cx" style="display: block; padding: 0 10px">                 ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -729,7 +861,7 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Find the next tag if it exists.
</span><span class="cx" style="display: block; padding: 0 10px">                if ( false === $this->parse_next_tag() ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        if ( self::STATE_INCOMPLETE === $this->parser_state ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                 $this->bytes_already_parsed = $was_at;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -736,6 +868,20 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                /*
+                * For legacy reasons the rest of this function handles tags and their
+                * attributes. If the processor has reached the end of the document
+                * or if it matched any other token then it should return here to avoid
+                * attempting to process tag-specific syntax.
+                */
+               if (
+                       self::STATE_INCOMPLETE_INPUT !== $this->parser_state &&
+                       self::STATE_COMPLETE !== $this->parser_state &&
+                       self::STATE_MATCHED_TAG !== $this->parser_state
+               ) {
+                       return true;
+               }
+
</ins><span class="cx" style="display: block; padding: 0 10px">                 // Parse all of its attributes.
</span><span class="cx" style="display: block; padding: 0 10px">                while ( $this->parse_next_attribute() ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        continue;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -743,11 +889,11 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Ensure that the tag closes before the end of the document.
</span><span class="cx" style="display: block; padding: 0 10px">                if (
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        self::STATE_INCOMPLETE === $this->parser_state ||
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 self::STATE_INCOMPLETE_INPUT === $this->parser_state ||
</ins><span class="cx" style="display: block; padding: 0 10px">                         $this->bytes_already_parsed >= strlen( $this->html )
</span><span class="cx" style="display: block; padding: 0 10px">                ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        // Does this appropriately clear state (parsed attributes)?
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state         = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px">                         $this->bytes_already_parsed = $was_at;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -755,7 +901,7 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
</span><span class="cx" style="display: block; padding: 0 10px">                if ( false === $tag_ends_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state         = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px">                         $this->bytes_already_parsed = $was_at;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -762,7 +908,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px">                $this->parser_state         = self::STATE_MATCHED_TAG;
</span><span class="cx" style="display: block; padding: 0 10px">                $this->token_length         = $tag_ends_at - $this->token_starts_at;
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $this->bytes_already_parsed = $tag_ends_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $this->bytes_already_parsed = $tag_ends_at + 1;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                /*
</span><span class="cx" style="display: block; padding: 0 10px">                 * For non-DATA sections which might contain text that looks like HTML tags but
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -771,8 +917,8 @@
</span><span class="cx" style="display: block; padding: 0 10px">                 */
</span><span class="cx" style="display: block; padding: 0 10px">                $t = $this->html[ $this->tag_name_starts_at ];
</span><span class="cx" style="display: block; padding: 0 10px">                if (
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        ! $this->is_closing_tag &&
-                       (
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->is_closing_tag ||
+                       ! (
</ins><span class="cx" style="display: block; padding: 0 10px">                                 'i' === $t || 'I' === $t ||
</span><span class="cx" style="display: block; padding: 0 10px">                                'n' === $t || 'N' === $t ||
</span><span class="cx" style="display: block; padding: 0 10px">                                's' === $t || 'S' === $t ||
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -780,38 +926,81 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                'x' === $t || 'X' === $t
</span><span class="cx" style="display: block; padding: 0 10px">                        )
</span><span class="cx" style="display: block; padding: 0 10px">                ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $tag_name = $this->get_tag();
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 return true;
+               }
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        if ( 'SCRIPT' === $tag_name && ! $this->skip_script_data() ) {
-                               $this->parser_state         = self::STATE_INCOMPLETE;
-                               $this->bytes_already_parsed = $was_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $tag_name = $this->get_tag();
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                return false;
-                       } elseif (
-                               ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) &&
-                               ! $this->skip_rcdata( $tag_name )
-                       ) {
-                               $this->parser_state         = self::STATE_INCOMPLETE;
-                               $this->bytes_already_parsed = $was_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         /*
+                * Preserve the opening tag pointers, as these will be overwritten
+                * when finding the closing tag. They will be reset after finding
+                * the closing to tag to point to the opening of the special atomic
+                * tag sequence.
+                */
+               $tag_name_starts_at   = $this->tag_name_starts_at;
+               $tag_name_length      = $this->tag_name_length;
+               $tag_ends_at          = $this->token_starts_at + $this->token_length;
+               $attributes           = $this->attributes;
+               $duplicate_attributes = $this->duplicate_attributes;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                return false;
-                       } elseif (
-                               (
-                                       'IFRAME' === $tag_name ||
-                                       'NOEMBED' === $tag_name ||
-                                       'NOFRAMES' === $tag_name ||
-                                       'STYLE' === $tag_name ||
-                                       'XMP' === $tag_name
-                               ) &&
-                               ! $this->skip_rawtext( $tag_name )
-                       ) {
-                               $this->parser_state         = self::STATE_INCOMPLETE;
-                               $this->bytes_already_parsed = $was_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         // Find the closing tag if necessary.
+               $found_closer = false;
+               switch ( $tag_name ) {
+                       case 'SCRIPT':
+                               $found_closer = $this->skip_script_data();
+                               break;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                return false;
-                       }
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 case 'TEXTAREA':
+                       case 'TITLE':
+                               $found_closer = $this->skip_rcdata( $tag_name );
+                               break;
+
+                       /*
+                        * In the browser this list would include the NOSCRIPT element,
+                        * but the Tag Processor is an environment with the scripting
+                        * flag disabled, meaning that it needs to descend into the
+                        * NOSCRIPT element to be able to properly process what will be
+                        * sent to a browser.
+                        *
+                        * Note that this rule makes HTML5 syntax incompatible with XML,
+                        * because the parsing of this token depends on client application.
+                        * The NOSCRIPT element cannot be represented in the XHTML syntax.
+                        */
+                       case 'IFRAME':
+                       case 'NOEMBED':
+                       case 'NOFRAMES':
+                       case 'STYLE':
+                       case 'XMP':
+                               $found_closer = $this->skip_rawtext( $tag_name );
+                               break;
+
+                       // No other tags should be treated in their entirety here.
+                       default:
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                if ( ! $found_closer ) {
+                       $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
+                       $this->bytes_already_parsed = $was_at;
+                       return false;
+               }
+
+               /*
+                * The values here look like they reference the opening tag but they reference
+                * the closing tag instead. This is why the opening tag values were stored
+                * above in a variable. It reads confusingly here, but that's because the
+                * functions that skip the contents have moved all the internal cursors past
+                * the inner content of the tag.
+                */
+               $this->token_starts_at      = $was_at;
+               $this->token_length         = $this->bytes_already_parsed - $this->token_starts_at;
+               $this->text_starts_at       = $tag_ends_at + 1;
+               $this->text_length          = $this->tag_name_starts_at - $this->text_starts_at;
+               $this->tag_name_starts_at   = $tag_name_starts_at;
+               $this->tag_name_length      = $tag_name_length;
+               $this->attributes           = $attributes;
+               $this->duplicate_attributes = $duplicate_attributes;
+
</ins><span class="cx" style="display: block; padding: 0 10px">                 return true;
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -830,7 +1019,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @return bool Whether the parse paused at the start of an incomplete token.
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        public function paused_at_incomplete_token() {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                return self::STATE_INCOMPLETE === $this->parser_state;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         return self::STATE_INCOMPLETE_INPUT === $this->parser_state;
</ins><span class="cx" style="display: block; padding: 0 10px">         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1007,7 +1196,10 @@
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        public function set_bookmark( $name ) {
</span><span class="cx" style="display: block; padding: 0 10px">                // It only makes sense to set a bookmark if the parser has paused on a concrete token.
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         if (
+                       self::STATE_COMPLETE === $this->parser_state ||
+                       self::STATE_INCOMPLETE_INPUT === $this->parser_state
+               ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                         return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1082,7 +1274,8 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $at = $this->bytes_already_parsed;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                while ( false !== $at && $at < $doc_length ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $at = strpos( $this->html, '</', $at );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $at                       = strpos( $this->html, '</', $at );
+                       $this->tag_name_starts_at = $at;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        // Fail if there is no possible tag closer.
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( false === $at || ( $at + $tag_length ) >= $doc_length ) {
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1089,8 +1282,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $closer_potentially_starts_at = $at;
-                       $at                          += 2;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $at += 2;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><span class="cx" style="display: block; padding: 0 10px">                         * Find a case-insensitive match to the tag name.
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1131,15 +1323,25 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        while ( $this->parse_next_attribute() ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                continue;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+
</ins><span class="cx" style="display: block; padding: 0 10px">                         $at = $this->bytes_already_parsed;
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $at >= strlen( $this->html ) ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        if ( '>' === $html[ $at ] || '/' === $html[ $at ] ) {
-                               $this->bytes_already_parsed = $closer_potentially_starts_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 if ( '>' === $html[ $at ] ) {
+                               $this->bytes_already_parsed = $at + 1;
</ins><span class="cx" style="display: block; padding: 0 10px">                                 return true;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+
+                       if ( $at + 1 >= strlen( $this->html ) ) {
+                               return false;
+                       }
+
+                       if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) {
+                               $this->bytes_already_parsed = $at + 2;
+                               return true;
+                       }
</ins><span class="cx" style="display: block; padding: 0 10px">                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                return false;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1259,6 +1461,7 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $is_closing ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                $this->bytes_already_parsed = $closer_potentially_starts_at;
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                $this->tag_name_starts_at   = $closer_potentially_starts_at;
</ins><span class="cx" style="display: block; padding: 0 10px">                                 if ( $this->bytes_already_parsed >= $doc_length ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1268,13 +1471,13 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                if ( $this->bytes_already_parsed >= $doc_length ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                if ( '>' === $html[ $this->bytes_already_parsed ] ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $this->bytes_already_parsed = $closer_potentially_starts_at;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 ++$this->bytes_already_parsed;
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return true;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1303,17 +1506,34 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $html       = $this->html;
</span><span class="cx" style="display: block; padding: 0 10px">                $doc_length = strlen( $html );
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $at         = $this->bytes_already_parsed;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $was_at     = $this->bytes_already_parsed;
+               $at         = $was_at;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                while ( false !== $at && $at < $doc_length ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        $at = strpos( $html, '<', $at );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                        if ( $at > $was_at ) {
+                               $this->parser_state         = self::STATE_TEXT_NODE;
+                               $this->token_starts_at      = $was_at;
+                               $this->token_length         = $at - $was_at;
+                               $this->text_starts_at       = $was_at;
+                               $this->text_length          = $this->token_length;
+                               $this->bytes_already_parsed = $at;
+                               return true;
+                       }
+
</ins><span class="cx" style="display: block; padding: 0 10px">                         /*
</span><span class="cx" style="display: block; padding: 0 10px">                         * This does not imply an incomplete parse; it indicates that there
</span><span class="cx" style="display: block; padding: 0 10px">                         * can be nothing left in the document other than a #text node.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( false === $at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                return false;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state         = self::STATE_TEXT_NODE;
+                               $this->token_starts_at      = $was_at;
+                               $this->token_length         = strlen( $html ) - $was_at;
+                               $this->text_starts_at       = $was_at;
+                               $this->text_length          = $this->token_length;
+                               $this->bytes_already_parsed = strlen( $html );
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        $this->token_starts_at = $at;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1342,8 +1562,9 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 );
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $tag_name_prefix_length > 0 ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                ++$at;
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                $this->parser_state         = self::STATE_MATCHED_TAG;
+                               $this->tag_name_starts_at   = $at;
</ins><span class="cx" style="display: block; padding: 0 10px">                                 $this->tag_name_length      = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length );
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $this->tag_name_starts_at   = $at;
</del><span class="cx" style="display: block; padding: 0 10px">                                 $this->bytes_already_parsed = $at + $this->tag_name_length;
</span><span class="cx" style="display: block; padding: 0 10px">                                return true;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1353,18 +1574,18 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * the document. There is nothing left to parse.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $at + 1 >= $doc_length ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                         * <! transitions to markup declaration open state
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                  * `<!` transitions to markup declaration open state
</ins><span class="cx" style="display: block; padding: 0 10px">                          * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( '!' === $html[ $at + 1 ] ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                /*
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                 * <!-- transitions to a bogus comment state â€“ skip to the nearest -->
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                          * `<!--` transitions to a comment state â€“ apply further comment rules.
</ins><span class="cx" style="display: block; padding: 0 10px">                                  * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><span class="cx" style="display: block; padding: 0 10px">                                if (
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1375,7 +1596,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                        $closer_at = $at + 4;
</span><span class="cx" style="display: block; padding: 0 10px">                                        // If it's not possible to close the comment then there is nothing more to scan.
</span><span class="cx" style="display: block; padding: 0 10px">                                        if ( $doc_length <= $closer_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                         $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                        }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1383,8 +1604,27 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                        // Abruptly-closed empty comments are a sequence of dashes followed by `>`.
</span><span class="cx" style="display: block; padding: 0 10px">                                        $span_of_dashes = strspn( $html, '-', $closer_at );
</span><span class="cx" style="display: block; padding: 0 10px">                                        if ( '>' === $html[ $closer_at + $span_of_dashes ] ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                $at = $closer_at + $span_of_dashes + 1;
-                                               continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                         /*
+                                                * @todo When implementing `set_modifiable_text()` ensure that updates to this token
+                                                *       don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment
+                                                *       and bogus comment syntax, these leave no clear insertion point for text and
+                                                *       they need to be modified specially in order to contain text. E.g. to store
+                                                *       `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which
+                                                *       involves inserting an additional `-` into the token after the modifiable text.
+                                                */
+                                               $this->parser_state = self::STATE_COMMENT;
+                                               $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT;
+                                               $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at;
+
+                                               // Only provide modifiable text if the token is long enough to contain it.
+                                               if ( $span_of_dashes >= 2 ) {
+                                                       $this->comment_type   = self::COMMENT_AS_HTML_COMMENT;
+                                                       $this->text_starts_at = $this->token_starts_at + 4;
+                                                       $this->text_length    = $span_of_dashes - 2;
+                                               }
+
+                                               $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1;
+                                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                        /*
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1397,51 +1637,39 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                        while ( ++$closer_at < $doc_length ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                                $closer_at = strpos( $html, '--', $closer_at );
</span><span class="cx" style="display: block; padding: 0 10px">                                                if ( false === $closer_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                                if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                        $at = $closer_at + 3;
-                                                       continue 2;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                                 $this->parser_state         = self::STATE_COMMENT;
+                                                       $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
+                                                       $this->token_length         = $closer_at + 3 - $this->token_starts_at;
+                                                       $this->text_starts_at       = $this->token_starts_at + 4;
+                                                       $this->text_length          = $closer_at - $this->text_starts_at;
+                                                       $this->bytes_already_parsed = $closer_at + 3;
+                                                       return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                                                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                if ( $closer_at + 3 < $doc_length && '!' === $html[ $closer_at + 2 ] && '>' === $html[ $closer_at + 3 ] ) {
-                                                       $at = $closer_at + 4;
-                                                       continue 2;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                         if (
+                                                       $closer_at + 3 < $doc_length &&
+                                                       '!' === $html[ $closer_at + 2 ] &&
+                                                       '>' === $html[ $closer_at + 3 ]
+                                               ) {
+                                                       $this->parser_state         = self::STATE_COMMENT;
+                                                       $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
+                                                       $this->token_length         = $closer_at + 4 - $this->token_starts_at;
+                                                       $this->text_starts_at       = $this->token_starts_at + 4;
+                                                       $this->text_length          = $closer_at - $this->text_starts_at;
+                                                       $this->bytes_already_parsed = $closer_at + 4;
+                                                       return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                                                 }
</span><span class="cx" style="display: block; padding: 0 10px">                                        }
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                /*
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                 * <![CDATA[ transitions to CDATA section state â€“ skip to the nearest ]]>
-                                * The CDATA is case-sensitive.
-                                * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
-                                */
-                               if (
-                                       $doc_length > $at + 8 &&
-                                       '[' === $html[ $at + 2 ] &&
-                                       'C' === $html[ $at + 3 ] &&
-                                       'D' === $html[ $at + 4 ] &&
-                                       'A' === $html[ $at + 5 ] &&
-                                       'T' === $html[ $at + 6 ] &&
-                                       'A' === $html[ $at + 7 ] &&
-                                       '[' === $html[ $at + 8 ]
-                               ) {
-                                       $closer_at = strpos( $html, ']]>', $at + 9 );
-                                       if ( false === $closer_at ) {
-                                               $this->parser_state = self::STATE_INCOMPLETE;
-
-                                               return false;
-                                       }
-
-                                       $at = $closer_at + 3;
-                                       continue;
-                               }
-
-                               /*
-                                * <!DOCTYPE transitions to DOCTYPE state â€“ skip to the nearest >
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                          * `<!DOCTYPE` transitions to DOCTYPE state â€“ skip to the nearest >
</ins><span class="cx" style="display: block; padding: 0 10px">                                  * These are ASCII-case-insensitive.
</span><span class="cx" style="display: block; padding: 0 10px">                                 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1457,13 +1685,17 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                        $closer_at = strpos( $html, '>', $at + 9 );
</span><span class="cx" style="display: block; padding: 0 10px">                                        if ( false === $closer_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                                $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                         $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $at = $closer_at + 1;
-                                       continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 $this->parser_state         = self::STATE_DOCTYPE;
+                                       $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                                       $this->text_starts_at       = $this->token_starts_at + 9;
+                                       $this->text_length          = $closer_at - $this->text_starts_at;
+                                       $this->bytes_already_parsed = $closer_at + 1;
+                                       return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                /*
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1471,14 +1703,53 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * to the bogus comment state - skip to the nearest >. If no closer is
</span><span class="cx" style="display: block; padding: 0 10px">                                 * found then the HTML was truncated inside the markup declaration.
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $at = strpos( $html, '>', $at + 1 );
-                               if ( false === $at ) {
-                                       $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $closer_at = strpos( $html, '>', $at + 1 );
+                               if ( false === $closer_at ) {
+                                       $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state         = self::STATE_COMMENT;
+                               $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
+                               $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                               $this->text_starts_at       = $this->token_starts_at + 2;
+                               $this->text_length          = $closer_at - $this->text_starts_at;
+                               $this->bytes_already_parsed = $closer_at + 1;
+
+                               /*
+                                * Identify nodes that would be CDATA if HTML had CDATA sections.
+                                *
+                                * This section must occur after identifying the bogus comment end
+                                * because in an HTML parser it will span to the nearest `>`, even
+                                * if there's no `]]>` as would be required in an XML document. It
+                                * is therefore not possible to parse a CDATA section containing
+                                * a `>` in the HTML syntax.
+                                *
+                                * Inside foreign elements there is a discrepancy between browsers
+                                * and the specification on this.
+                                *
+                                * @todo Track whether the Tag Processor is inside a foreign element
+                                *       and require the proper closing `]]>` in those cases.
+                                */
+                               if (
+                                       $this->token_length >= 10 &&
+                                       '[' === $html[ $this->token_starts_at + 2 ] &&
+                                       'C' === $html[ $this->token_starts_at + 3 ] &&
+                                       'D' === $html[ $this->token_starts_at + 4 ] &&
+                                       'A' === $html[ $this->token_starts_at + 5 ] &&
+                                       'T' === $html[ $this->token_starts_at + 6 ] &&
+                                       'A' === $html[ $this->token_starts_at + 7 ] &&
+                                       '[' === $html[ $this->token_starts_at + 8 ] &&
+                                       ']' === $html[ $closer_at - 1 ]
+                               ) {
+                                       $this->parser_state    = self::STATE_COMMENT;
+                                       $this->comment_type    = self::COMMENT_AS_CDATA_LOOKALIKE;
+                                       $this->text_starts_at += 7;
+                                       $this->text_length    -= 9;
+                               }
+
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1491,24 +1762,71 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( '>' === $html[ $at + 1 ] ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                ++$at;
-                               continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state         = self::STATE_PRESUMPTUOUS_TAG;
+                               $this->token_length         = $at + 2 - $this->token_starts_at;
+                               $this->bytes_already_parsed = $at + 2;
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                         * <? transitions to a bogus comment state â€“ skip to the nearest >
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                  * `<?` transitions to a bogus comment state â€“ skip to the nearest >
</ins><span class="cx" style="display: block; padding: 0 10px">                          * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( '?' === $html[ $at + 1 ] ) {
</span><span class="cx" style="display: block; padding: 0 10px">                                $closer_at = strpos( $html, '>', $at + 2 );
</span><span class="cx" style="display: block; padding: 0 10px">                                if ( false === $closer_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $at = $closer_at + 1;
-                               continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state         = self::STATE_COMMENT;
+                               $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
+                               $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                               $this->text_starts_at       = $this->token_starts_at + 2;
+                               $this->text_length          = $closer_at - $this->text_starts_at;
+                               $this->bytes_already_parsed = $closer_at + 1;
+
+                               /*
+                                * Identify a Processing Instruction node were HTML to have them.
+                                *
+                                * This section must occur after identifying the bogus comment end
+                                * because in an HTML parser it will span to the nearest `>`, even
+                                * if there's no `?>` as would be required in an XML document. It
+                                * is therefore not possible to parse a Processing Instruction node
+                                * containing a `>` in the HTML syntax.
+                                *
+                                * XML allows for more target names, but this code only identifies
+                                * those with ASCII-representable target names. This means that it
+                                * may identify some Processing Instruction nodes as bogus comments,
+                                * but it will not misinterpret the HTML structure. By limiting the
+                                * identification to these target names the Tag Processor can avoid
+                                * the need to start parsing UTF-8 sequences.
+                                *
+                                * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
+                                *                     [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
+                                *                     [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
+                                *                     [#x10000-#xEFFFF]
+                                * > NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
+                                *
+                                * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget
+                                */
+                               if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) {
+                                       $comment_text     = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 );
+                                       $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' );
+
+                                       if ( 0 < $pi_target_length ) {
+                                               $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length );
+
+                                               $this->comment_type       = self::COMMENT_AS_PI_NODE_LOOKALIKE;
+                                               $this->tag_name_starts_at = $this->token_starts_at + 2;
+                                               $this->tag_name_length    = $pi_target_length;
+                                               $this->text_starts_at    += $pi_target_length;
+                                               $this->text_length       -= $pi_target_length + 1;
+                                       }
+                               }
+
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1515,6 +1833,9 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * If a non-alpha starts the tag name in a tag closer it's a comment.
</span><span class="cx" style="display: block; padding: 0 10px">                         * Find the first `>`, which closes the comment.
</span><span class="cx" style="display: block; padding: 0 10px">                         *
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         * This parser classifies these particular comments as special "funky comments"
+                        * which are made available for further processing.
+                        *
</ins><span class="cx" style="display: block; padding: 0 10px">                          * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $this->is_closing_tag ) {
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1525,13 +1846,17 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                $closer_at = strpos( $html, '>', $at + 3 );
</span><span class="cx" style="display: block; padding: 0 10px">                                if ( false === $closer_at ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $at = $closer_at + 1;
-                               continue;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state         = self::STATE_FUNKY_COMMENT;
+                               $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                               $this->text_starts_at       = $this->token_starts_at + 2;
+                               $this->text_length          = $closer_at - $this->text_starts_at;
+                               $this->bytes_already_parsed = $closer_at + 1;
+                               return true;
</ins><span class="cx" style="display: block; padding: 0 10px">                         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        ++$at;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1551,7 +1876,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                // Skip whitespace and slashes.
</span><span class="cx" style="display: block; padding: 0 10px">                $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed );
</span><span class="cx" style="display: block; padding: 0 10px">                if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1575,7 +1900,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $attribute_name              = substr( $this->html, $attribute_start, $name_length );
</span><span class="cx" style="display: block; padding: 0 10px">                $this->bytes_already_parsed += $name_length;
</span><span class="cx" style="display: block; padding: 0 10px">                if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1582,7 +1907,7 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $this->skip_whitespace();
</span><span class="cx" style="display: block; padding: 0 10px">                if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1592,7 +1917,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        ++$this->bytes_already_parsed;
</span><span class="cx" style="display: block; padding: 0 10px">                        $this->skip_whitespace();
</span><span class="cx" style="display: block; padding: 0 10px">                        if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                                return false;
</span><span class="cx" style="display: block; padding: 0 10px">                        }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1620,7 +1945,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                if ( $attribute_end >= strlen( $this->html ) ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        $this->parser_state = self::STATE_INCOMPLETE;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                 $this->parser_state = self::STATE_INCOMPLETE_INPUT;
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                        return false;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1692,8 +2017,11 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $this->token_length         = null;
</span><span class="cx" style="display: block; padding: 0 10px">                $this->tag_name_starts_at   = null;
</span><span class="cx" style="display: block; padding: 0 10px">                $this->tag_name_length      = null;
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                $this->text_starts_at       = 0;
+               $this->text_length          = 0;
</ins><span class="cx" style="display: block; padding: 0 10px">                 $this->is_closing_tag       = null;
</span><span class="cx" style="display: block; padding: 0 10px">                $this->attributes           = array();
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                $this->comment_type         = null;
</ins><span class="cx" style="display: block; padding: 0 10px">                 $this->duplicate_attributes = null;
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1985,7 +2313,7 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Point this tag processor before the sought tag opener and consume it.
</span><span class="cx" style="display: block; padding: 0 10px">                $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start;
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                return $this->next_tag( array( 'tag_closers' => 'visit' ) );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         return $this->next_token();
</ins><span class="cx" style="display: block; padding: 0 10px">         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2216,13 +2544,24 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @return string|null Name of currently matched tag in input HTML, or `null` if none found.
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        public function get_tag() {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         if ( null === $this->tag_name_starts_at ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                         return null;
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                return strtoupper( $tag_name );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
+                       return strtoupper( $tag_name );
+               }
+
+               if (
+                       self::STATE_COMMENT === $this->parser_state &&
+                       self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
+               ) {
+                       return $tag_name;
+               }
+
+               return null;
</ins><span class="cx" style="display: block; padding: 0 10px">         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2282,6 +2621,191 @@
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * Indicates the kind of matched token, if any.
+        *
+        * This differs from `get_token_name()` in that it always
+        * returns a static string indicating the type, whereas
+        * `get_token_name()` may return values derived from the
+        * token itself, such as a tag name or processing
+        * instruction tag.
+        *
+        * Possible values:
+        *  - `#tag` when matched on a tag.
+        *  - `#text` when matched on a text node.
+        *  - `#cdata-section` when matched on a CDATA node.
+        *  - `#comment` when matched on a comment.
+        *  - `#doctype` when matched on a DOCTYPE declaration.
+        *  - `#presumptuous-tag` when matched on an empty tag closer.
+        *  - `#funky-comment` when matched on a funky comment.
+        *
+        * @since 6.5.0
+        *
+        * @return string|null What kind of token is matched, or null.
+        */
+       public function get_token_type() {
+               switch ( $this->parser_state ) {
+                       case self::STATE_MATCHED_TAG:
+                               return '#tag';
+
+                       case self::STATE_DOCTYPE:
+                               return '#doctype';
+
+                       default:
+                               return $this->get_token_name();
+               }
+       }
+
+       /**
+        * Returns the node name represented by the token.
+        *
+        * This matches the DOM API value `nodeName`. Some values
+        * are static, such as `#text` for a text node, while others
+        * are dynamically generated from the token itself.
+        *
+        * Dynamic names:
+        *  - Uppercase tag name for tag matches.
+        *  - `html` for DOCTYPE declarations.
+        *
+        * Note that if the Tag Processor is not matched on a token
+        * then this function will return `null`, either because it
+        * hasn't yet found a token or because it reached the end
+        * of the document without matching a token.
+        *
+        * @since 6.5.0
+        *
+        * @return string|null Name of the matched token.
+        */
+       public function get_token_name() {
+               switch ( $this->parser_state ) {
+                       case self::STATE_MATCHED_TAG:
+                               return $this->get_tag();
+
+                       case self::STATE_TEXT_NODE:
+                               return '#text';
+
+                       case self::STATE_CDATA_NODE:
+                               return '#cdata-section';
+
+                       case self::STATE_COMMENT:
+                               return '#comment';
+
+                       case self::STATE_DOCTYPE:
+                               return 'html';
+
+                       case self::STATE_PRESUMPTUOUS_TAG:
+                               return '#presumptuous-tag';
+
+                       case self::STATE_FUNKY_COMMENT:
+                               return '#funky-comment';
+               }
+       }
+
+       /**
+        * Indicates what kind of comment produced the comment node.
+        *
+        * Because there are different kinds of HTML syntax which produce
+        * comments, the Tag Processor tracks and exposes this as a type
+        * for the comment. Nominally only regular HTML comments exist as
+        * they are commonly known, but a number of unrelated syntax errors
+        * also produce comments.
+        *
+        * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT
+        * @see self::COMMENT_AS_CDATA_LOOKALIKE
+        * @see self::COMMENT_AS_INVALID_HTML
+        * @see self::COMMENT_AS_HTML_COMMENT
+        * @see self::COMMENT_AS_PI_NODE_LOOKALIKE
+        *
+        * @since 6.5.0
+        *
+        * @return string|null
+        */
+       public function get_comment_type() {
+               if ( self::STATE_COMMENT !== $this->parser_state ) {
+                       return null;
+               }
+
+               return $this->comment_type;
+       }
+
+       /**
+        * Returns the modifiable text for a matched token, or an empty string.
+        *
+        * Modifiable text is text content that may be read and changed without
+        * changing the HTML structure of the document around it. This includes
+        * the contents of `#text` nodes in the HTML as well as the inner
+        * contents of HTML comments, Processing Instructions, and others, even
+        * though these nodes aren't part of a parsed DOM tree. They also contain
+        * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
+        * other section in an HTML document which cannot contain HTML markup (DATA).
+        *
+        * If a token has no modifiable text then an empty string is returned to
+        * avoid needless crashing or type errors. An empty string does not mean
+        * that a token has modifiable text, and a token with modifiable text may
+        * have an empty string (e.g. a comment with no contents).
+        *
+        * @since 6.5.0
+        *
+        * @return string
+        */
+       public function get_modifiable_text() {
+               if ( null === $this->text_starts_at ) {
+                       return '';
+               }
+
+               $text = substr( $this->html, $this->text_starts_at, $this->text_length );
+
+               // Comment data is not decoded.
+               if (
+                       self::STATE_CDATA_NODE === $this->parser_state ||
+                       self::STATE_COMMENT === $this->parser_state ||
+                       self::STATE_DOCTYPE === $this->parser_state ||
+                       self::STATE_FUNKY_COMMENT === $this->parser_state
+               ) {
+                       return $text;
+               }
+
+               $tag_name = $this->get_tag();
+               if (
+                       // Script data is not decoded.
+                       'SCRIPT' === $tag_name ||
+
+                       // RAWTEXT data is not decoded.
+                       'IFRAME' === $tag_name ||
+                       'NOEMBED' === $tag_name ||
+                       'NOFRAMES' === $tag_name ||
+                       'STYLE' === $tag_name ||
+                       'XMP' === $tag_name
+               ) {
+                       return $text;
+               }
+
+               $decoded = html_entity_decode( $text, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE );
+
+               if ( empty( $decoded ) ) {
+                       return '';
+               }
+
+               /*
+                * TEXTAREA skips a leading newline, but this newline may appear not only as the
+                * literal character `\n`, but also as a character reference, such as in the
+                * following markup: `<textarea>&#x0a;Content</textarea>`.
+                *
+                * For these cases it's important to first decode the text content before checking
+                * for a leading newline and removing it.
+                */
+               if (
+                       self::STATE_MATCHED_TAG === $this->parser_state &&
+                       'TEXTAREA' === $tag_name &&
+                       strlen( $decoded ) > 0 &&
+                       "\n" === $decoded[0]
+               ) {
+                       return substr( $decoded, 1 );
+               }
+
+               return $decoded;
+       }
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * Updates or creates a new attribute on the currently matched tag with the passed value.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * For boolean attributes special handling is provided:
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2746,7 +3270,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * Parser Ready State
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * Parser Ready State.
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * Indicates that the parser is ready to run and waiting for a state transition.
</span><span class="cx" style="display: block; padding: 0 10px">         * It may not have started yet, or it may have just finished parsing a token and
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2759,7 +3283,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">        const STATE_READY = 'STATE_READY';
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * Parser Complete State
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * Parser Complete State.
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * Indicates that the parser has reached the end of the document and there is
</span><span class="cx" style="display: block; padding: 0 10px">         * nothing left to scan. It finished parsing the last token completely.
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2771,7 +3295,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">        const STATE_COMPLETE = 'STATE_COMPLETE';
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * Parser Incomplete State
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * Parser Incomplete Input State.
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * Indicates that the parser has reached the end of the document before finishing
</span><span class="cx" style="display: block; padding: 0 10px">         * a token. It started parsing a token but there is a possibility that the input
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2784,10 +3308,10 @@
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * @access private
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-        const STATE_INCOMPLETE = 'STATE_INCOMPLETE';
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+ const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT';
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-         * Parser Matched Tag State
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+  * Parser Matched Tag State.
</ins><span class="cx" style="display: block; padding: 0 10px">          *
</span><span class="cx" style="display: block; padding: 0 10px">         * Indicates that the parser has found an HTML tag and it's possible to get
</span><span class="cx" style="display: block; padding: 0 10px">         * the tag name and read or modify its attributes (if it's not a closing tag).
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2797,4 +3321,153 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @access private
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+
+       /**
+        * Parser Text Node State.
+        *
+        * Indicates that the parser has found a text node and it's possible
+        * to read and modify that text.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_TEXT_NODE = 'STATE_TEXT_NODE';
+
+       /**
+        * Parser CDATA Node State.
+        *
+        * Indicates that the parser has found a CDATA node and it's possible
+        * to read and modify its modifiable text. Note that in HTML there are
+        * no CDATA nodes outside of foreign content (SVG and MathML). Outside
+        * of foreign content, they are treated as HTML comments.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_CDATA_NODE = 'STATE_CDATA_NODE';
+
+       /**
+        * Indicates that the parser has found an HTML comment and it's
+        * possible to read and modify its modifiable text.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_COMMENT = 'STATE_COMMENT';
+
+       /**
+        * Indicates that the parser has found a DOCTYPE node and it's
+        * possible to read and modify its modifiable text.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_DOCTYPE = 'STATE_DOCTYPE';
+
+       /**
+        * Indicates that the parser has found an empty tag closer `</>`.
+        *
+        * Note that in HTML there are no empty tag closers, and they
+        * are ignored. Nonetheless, the Tag Processor still
+        * recognizes them as they appear in the HTML stream.
+        *
+        * These were historically discussed as a "presumptuous tag
+        * closer," which would close the nearest open tag, but were
+        * dismissed in favor of explicitly-closing tags.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG';
+
+       /**
+        * Indicates that the parser has found a "funky comment"
+        * and it's possible to read and modify its modifiable text.
+        *
+        * Example:
+        *
+        *     </%url>
+        *     </{"wp-bit":"query/post-author"}>
+        *     </2>
+        *
+        * Funky comments are tag closers with invalid tag names. Note
+        * that in HTML these are turn into bogus comments. Nonetheless,
+        * the Tag Processor recognizes them in a stream of HTML and
+        * exposes them for inspection and modification.
+        *
+        * @since 6.5.0
+        *
+        * @access private
+        */
+       const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY';
+
+       /**
+        * Indicates that a comment was created when encountering abruptly-closed HTML comment.
+        *
+        * Example:
+        *
+        *     <!-->
+        *     <!--->
+        *
+        * @since 6.5.0
+        */
+       const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT';
+
+       /**
+        * Indicates that a comment would be parsed as a CDATA node,
+        * were HTML to allow CDATA nodes outside of foreign content.
+        *
+        * Example:
+        *
+        *     <![CDATA[This is a CDATA node.]]>
+        *
+        * This is an HTML comment, but it looks like a CDATA node.
+        *
+        * @since 6.5.0
+        */
+       const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE';
+
+       /**
+        * Indicates that a comment was created when encountering
+        * normative HTML comment syntax.
+        *
+        * Example:
+        *
+        *     <!-- this is a comment -->
+        *
+        * @since 6.5.0
+        */
+       const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT';
+
+       /**
+        * Indicates that a comment would be parsed as a Processing
+        * Instruction node, were they to exist within HTML.
+        *
+        * Example:
+        *
+        *     <?wp __( 'Like' ) ?>
+        *
+        * This is an HTML comment, but it looks like a CDATA node.
+        *
+        * @since 6.5.0
+        */
+       const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE';
+
+       /**
+        * Indicates that a comment was created when encountering invalid
+        * HTML input, a so-called "bogus comment."
+        *
+        * Example:
+        *
+        *     <?nothing special>
+        *     <!{nothing special}>
+        *
+        * @since 6.5.0
+        */
+       const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML';
</ins><span class="cx" style="display: block; padding: 0 10px"> }
</span></span></pre></div>
<a id="trunktestsphpunittestshtmlapiwpHtmlProcessorBreadcrumbsphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php 2024-01-24 19:25:00 UTC (rev 57347)
+++ trunk/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php   2024-01-24 23:35:46 UTC (rev 57348)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -514,7 +514,11 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @covers WP_HTML_Processor::seek
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        public function test_can_seek_back_and_forth() {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $p = WP_HTML_Processor::create_fragment( '<div><p one><div><p><div two><p><div><p><div><p three>' );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $p = WP_HTML_Processor::create_fragment(
+                       <<<'HTML'
+<div>text<p one>more stuff<div><![CDATA[this is not real CDATA]]><p><!-- hi --><div two><p><div><p>three comes soon<div><p three>' );
+HTML
+               );
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Find first tag of interest.
</span><span class="cx" style="display: block; padding: 0 10px">                while ( $p->next_tag() && null === $p->get_attribute( 'one' ) ) {
</span></span></pre></div>
<a id="trunktestsphpunittestshtmlapiwpHtmlTagProcessortokenscanningphp"></a>
<div class="addfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Added: trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php                          (rev 0)
+++ trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php    2024-01-24 23:35:46 UTC (rev 57348)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -0,0 +1,734 @@
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+<?php
+/**
+ * Unit tests covering WP_HTML_Tag_Processor token-scanning functionality.
+ *
+ * @package WordPress
+ * @subpackage HTML-API
+ *
+ * @since 6.5.0
+ *
+ * @group html-api
+ *
+ * @coversDefaultClass WP_HTML_Tag_Processor
+ */
+class Tests_HtmlApi_WpHtmlProcessor_Token_Scanning extends WP_UnitTestCase {
+       /**
+        * Ensures that scanning finishes in a complete form when the document is empty.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_completes_empty_document() {
+               $processor = new WP_HTML_Tag_Processor( '' );
+
+               $this->assertFalse(
+                       $processor->next_token(),
+                       "Should not have found any tokens but found {$processor->get_token_type()}."
+               );
+       }
+
+       /**
+        * Ensures that normative text nodes are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_text_node() {
+               $processor = new WP_HTML_Tag_Processor( 'Hello, World!' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#text',
+                       $processor->get_token_type(),
+                       "Should have found #text token type but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       'Hello, World!',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative Elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_element() {
+               $processor = new WP_HTML_Tag_Processor( '<div id="test" inert>Hello, World!</div>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       'DIV',
+                       $processor->get_token_name(),
+                       "Should have found DIV tag name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       'test',
+                       $processor->get_attribute( 'id' ),
+                       "Should have found id attribute value 'test' but found {$processor->get_attribute( 'id' )} instead."
+               );
+
+               $this->assertTrue(
+                       $processor->get_attribute( 'inert' ),
+                       "Should have found boolean attribute 'inert' but didn't."
+               );
+
+               $attributes     = $processor->get_attribute_names_with_prefix( '' );
+               $attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+               $this->assertSame(
+                       array( 'id', 'inert' ),
+                       $attributes,
+                       'Should have found only two attributes but found ' . implode( ', ', $attribute_list ) . ' instead.'
+               );
+
+               $this->assertSame(
+                       '',
+                       $processor->get_modifiable_text(),
+                       "Should have found empty modifiable text but found '{$processor->get_modifiable_text()}' instead."
+               );
+       }
+
+       /**
+        * Ensures that normative SCRIPT elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_script_element() {
+               $processor = new WP_HTML_Tag_Processor( '<script type="module">console.log( "Hello, World!" );</script>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       'SCRIPT',
+                       $processor->get_token_name(),
+                       "Should have found SCRIPT tag name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       'module',
+                       $processor->get_attribute( 'type' ),
+                       "Should have found type attribute value 'module' but found {$processor->get_attribute( 'type' )} instead."
+               );
+
+               $attributes     = $processor->get_attribute_names_with_prefix( '' );
+               $attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+               $this->assertSame(
+                       array( 'type' ),
+                       $attributes,
+                       "Should have found single 'type' attribute but found " . implode( ', ', $attribute_list ) . ' instead.'
+               );
+
+               $this->assertSame(
+                       'console.log( "Hello, World!" );',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative TEXTAREA elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_textarea_element() {
+               $processor = new WP_HTML_Tag_Processor(
+                       <<<HTML
+<textarea rows=30 cols="80">
+Is <HTML> &gt; XHTML?
+</textarea>
+HTML
+               );
+               $processor->next_token();
+
+               $this->assertSame(
+                       'TEXTAREA',
+                       $processor->get_token_name(),
+                       "Should have found TEXTAREA tag name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       '30',
+                       $processor->get_attribute( 'rows' ),
+                       "Should have found rows attribute value 'module' but found {$processor->get_attribute( 'rows' )} instead."
+               );
+
+               $this->assertSame(
+                       '80',
+                       $processor->get_attribute( 'cols' ),
+                       "Should have found cols attribute value 'module' but found {$processor->get_attribute( 'cols' )} instead."
+               );
+
+               $attributes     = $processor->get_attribute_names_with_prefix( '' );
+               $attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+               $this->assertSame(
+                       array( 'rows', 'cols' ),
+                       $attributes,
+                       'Should have found only two attributes but found ' . implode( ', ', $attribute_list ) . ' instead.'
+               );
+
+               // Note that the leading newline should be removed from the TEXTAREA contents.
+               $this->assertSame(
+                       "Is <HTML> > XHTML?\n",
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative TITLE elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_title_element() {
+               $processor = new WP_HTML_Tag_Processor(
+                       <<<HTML
+<title class="multi-line-title">
+Is <HTML> &gt; XHTML?
+</title>
+HTML
+               );
+               $processor->next_token();
+
+               $this->assertSame(
+                       'TITLE',
+                       $processor->get_token_name(),
+                       "Should have found TITLE tag name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       'multi-line-title',
+                       $processor->get_attribute( 'class' ),
+                       "Should have found class attribute value 'multi-line-title' but found {$processor->get_attribute( 'rows' )} instead."
+               );
+
+               $attributes     = $processor->get_attribute_names_with_prefix( '' );
+               $attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+               $this->assertSame(
+                       array( 'class' ),
+                       $attributes,
+                       'Should have found only one attribute but found ' . implode( ', ', $attribute_list ) . ' instead.'
+               );
+
+               $this->assertSame(
+                       "\nIs <HTML> > XHTML?\n",
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative RAWTEXT elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        *
+        * @dataProvider data_rawtext_elements
+        *
+        * @param string $tag_name The name of the RAWTEXT tag to test.
+        */
+       public function test_basic_assertion_rawtext_elements( $tag_name ) {
+               $processor = new WP_HTML_Tag_Processor(
+                       <<<HTML
+<{$tag_name} class="multi-line-title">
+Is <HTML> &gt; XHTML?
+</{$tag_name}>
+HTML
+               );
+               $processor->next_token();
+
+               $this->assertSame(
+                       $tag_name,
+                       $processor->get_token_name(),
+                       "Should have found {$tag_name} tag name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       'multi-line-title',
+                       $processor->get_attribute( 'class' ),
+                       "Should have found class attribute value 'multi-line-title' but found {$processor->get_attribute( 'rows' )} instead."
+               );
+
+               $attributes     = $processor->get_attribute_names_with_prefix( '' );
+               $attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+               $this->assertSame(
+                       array( 'class' ),
+                       $attributes,
+                       'Should have found only one attribute but found ' . implode( ', ', $attribute_list ) . ' instead.'
+               );
+
+               $this->assertSame(
+                       "\nIs <HTML> &gt; XHTML?\n",
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Data provider.
+        *
+        * @return array[].
+        */
+       public function data_rawtext_elements() {
+               return array(
+                       'IFRAME'   => array( 'IFRAME' ),
+                       'NOEMBED'  => array( 'NOEMBED' ),
+                       'NOFRAMES' => array( 'NOFRAMES' ),
+                       'STYLE'    => array( 'STYLE' ),
+                       'XMP'      => array( 'XMP' ),
+               );
+       }
+
+       /**
+        * Ensures that normative CDATA sections are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_cdata_section() {
+               $processor = new WP_HTML_Tag_Processor( '<![CDATA[this is a comment]]>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found comment token but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       WP_HTML_Processor::COMMENT_AS_CDATA_LOOKALIKE,
+                       $processor->get_comment_type(),
+                       'Should have detected a CDATA-like invalid comment.'
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       'this is a comment',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that abruptly-closed CDATA sections are properly parsed as comments.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_abruptly_closed_cdata_section() {
+               $processor = new WP_HTML_Tag_Processor( '<![CDATA[this is > a comment]]>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found a bogus comment but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       '[CDATA[this is ',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#text',
+                       $processor->get_token_name(),
+                       "Should have found text node but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       ' a comment]]>',
+                       $processor->get_modifiable_text(),
+                       'Should have found remaining syntax from abruptly-closed CDATA section.'
+               );
+       }
+
+       /**
+        * Ensures that normative Processing Instruction nodes are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_processing_instruction() {
+               $processor = new WP_HTML_Tag_Processor( '<?wp-bit {"just": "kidding"}?>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found comment token but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertSame(
+                       WP_HTML_Processor::COMMENT_AS_PI_NODE_LOOKALIKE,
+                       $processor->get_comment_type(),
+                       'Should have detected a Processing Instruction-like invalid comment.'
+               );
+
+               $this->assertSame(
+                       'wp-bit',
+                       $processor->get_tag(),
+                       "Should have found PI target as tag name but found {$processor->get_tag()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       ' {"just": "kidding"}',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that abruptly-closed Processing Instruction nodes are properly parsed as comments.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_abruptly_closed_processing_instruction() {
+               $processor = new WP_HTML_Tag_Processor( '<?version=">=5.3.6"?>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_type(),
+                       "Should have found bogus comment but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found #comment as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       'version="',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+
+               $processor->next_token();
+
+               $this->assertSame(
+                       '=5.3.6"?>',
+                       $processor->get_modifiable_text(),
+                       'Should have found remaining syntax from abruptly-closed Processing Instruction.'
+               );
+       }
+
+       /**
+        * Ensures that common comments are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @dataProvider data_common_comments
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        *
+        * @param string $html Contains the comment in full.
+        * @param string $text Contains the appropriate modifiable text.
+        */
+       public function test_basic_assertion_common_comments( $html, $text ) {
+               $processor = new WP_HTML_Tag_Processor( $html );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_type(),
+                       "Should have found comment but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found #comment as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       $text,
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Data provider.
+        *
+        * @return array[].
+        */
+       public function data_common_comments() {
+               return array(
+                       'Shortest comment'        => array( '<!-->', '' ),
+                       'Short comment'           => array( '<!--->', '' ),
+                       'Short comment w/o text'  => array( '<!---->', '' ),
+                       'Short comment with text' => array( '<!----->', '-' ),
+                       'PI node without target'  => array( '<? missing?>', ' missing?' ),
+                       'Invalid PI node'         => array( '<?/missing/>', '/missing/' ),
+                       'Invalid ! directive'     => array( '<!something else>', 'something else' ),
+               );
+       }
+
+       /**
+        * Ensures that normative HTML comments are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_html_comment() {
+               $processor = new WP_HTML_Tag_Processor( '<!-- wp:paragraph -->' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_type(),
+                       "Should have found comment but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       '#comment',
+                       $processor->get_token_name(),
+                       "Should have found #comment as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       ' wp:paragraph ',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative DOCTYPE elements are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_doctype() {
+               $processor = new WP_HTML_Tag_Processor( '<!DOCTYPE html>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#doctype',
+                       $processor->get_token_type(),
+                       "Should have found DOCTYPE but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       'html',
+                       $processor->get_token_name(),
+                       "Should have found 'html' as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       ' html',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative presumptuous tag closers (empty closers) are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_presumptuous_tag() {
+               $processor = new WP_HTML_Tag_Processor( '</>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#presumptuous-tag',
+                       $processor->get_token_type(),
+                       "Should have found presumptuous tag but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       '#presumptuous-tag',
+                       $processor->get_token_name(),
+                       "Should have found #presumptuous-tag as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       '',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Ensures that normative funky comments are properly parsed.
+        *
+        * @ticket 60170
+        *
+        * @since 6.5.0
+        *
+        * @covers WP_HTML_Tag_Processor::next_token
+        */
+       public function test_basic_assertion_funky_comment() {
+               $processor = new WP_HTML_Tag_Processor( '</%url>' );
+               $processor->next_token();
+
+               $this->assertSame(
+                       '#funky-comment',
+                       $processor->get_token_type(),
+                       "Should have found funky comment but found {$processor->get_token_type()} instead."
+               );
+
+               $this->assertSame(
+                       '#funky-comment',
+                       $processor->get_token_name(),
+                       "Should have found #funky-comment as name but found {$processor->get_token_name()} instead."
+               );
+
+               $this->assertNull(
+                       $processor->get_tag(),
+                       'Should not have been able to query tag name on non-element token.'
+               );
+
+               $this->assertNull(
+                       $processor->get_attribute( 'type' ),
+                       'Should not have been able to query attributes on non-element token.'
+               );
+
+               $this->assertSame(
+                       '%url',
+                       $processor->get_modifiable_text(),
+                       'Found incorrect modifiable text.'
+               );
+       }
+
+       /**
+        * Test helper that wraps a string in double quotes.
+        *
+        * @param string $s The string to wrap in double-quotes.
+        * @return string The string wrapped in double-quotes.
+        */
+       private static function quoted( $s ) {
+               return "\"$s\"";
+       }
+}
</ins><span class="cx" style="display: block; padding: 0 10px">Property changes on: trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php
</span><span class="cx" style="display: block; padding: 0 10px">___________________________________________________________________
</span></span></pre></div>
<a id="svneolstyle"></a>
<div class="addfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Added: svn:eol-style</h4></div>
<ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+native
</ins><span class="cx" style="display: block; padding: 0 10px">\ No newline at end of property
</span><a id="trunktestsphpunittestshtmlapiwpHtmlTagProcessorphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php 2024-01-24 19:25:00 UTC (rev 57347)
+++ trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php   2024-01-24 23:35:46 UTC (rev 57348)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -557,8 +557,10 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( '<script>abc</script>' );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p->next_tag();
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </script> tag closer' );
-               $this->assertTrue( $p->is_tag_closer(), 'Indicated a <script> tag opener is a tag closer' );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $this->assertFalse(
+                       $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+                       'Should not have found closing SCRIPT tag when closing an opener.'
+               );
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( 'abc</script>' );
</span><span class="cx" style="display: block; padding: 0 10px">                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </script> tag closer when there was no tag opener' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -566,8 +568,10 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( '<textarea>abc</textarea>' );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p->next_tag();
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </textarea> tag closer' );
-               $this->assertTrue( $p->is_tag_closer(), 'Indicated a <textarea> tag opener is a tag closer' );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $this->assertFalse(
+                       $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+                       'Should not have found closing TEXTAREA when closing an opener.'
+               );
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( 'abc</textarea>' );
</span><span class="cx" style="display: block; padding: 0 10px">                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </textarea> tag closer when there was no tag opener' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -575,8 +579,10 @@
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( '<title>abc</title>' );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p->next_tag();
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </title> tag closer' );
-               $this->assertTrue( $p->is_tag_closer(), 'Indicated a <title> tag opener is a tag closer' );
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $this->assertFalse(
+                       $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+                       'Should not have found closing TITLE when closing an opener.'
+               );
</ins><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                $p = new WP_HTML_Tag_Processor( 'abc</title>' );
</span><span class="cx" style="display: block; padding: 0 10px">                $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </title> tag closer when there was no tag opener' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2357,6 +2363,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        'No tags'                => array( 'this is nothing more than a text node' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Text with comments'     => array( 'One <!-- sneaky --> comment.' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Empty tag closer'       => array( '</>' ),
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                        'CDATA as HTML comment'  => array( '<![CDATA[this closes at the first &gt;]>' ),
</ins><span class="cx" style="display: block; padding: 0 10px">                         'Processing instruction' => array( '<?xml version="1.0"?>' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Combination XML-like'   => array( '<!DOCTYPE xml><?xml version=""?><!-- this is not a real document. --><![CDATA[it only serves as a test]]>' ),
</span><span class="cx" style="display: block; padding: 0 10px">                );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2410,7 +2417,6 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        'Incomplete CDATA'                     => array( '<![CDATA[something inside of here needs to get out' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Partial CDATA'                        => array( '<![CDA' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Partially closed CDATA]'              => array( '<![CDATA[cannot escape]' ),
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        'Partially closed CDATA]>'             => array( '<![CDATA[cannot escape]>' ),
</del><span class="cx" style="display: block; padding: 0 10px">                         'Unclosed IFRAME'                      => array( '<iframe><div>' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Unclosed NOEMBED'                     => array( '<noembed><div>' ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'Unclosed NOFRAMES'                    => array( '<noframes><div>' ),
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2507,7 +2513,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        ),
</span><span class="cx" style="display: block; padding: 0 10px">                        'tag inside of CDATA'      => array(
</span><span class="cx" style="display: block; padding: 0 10px">                                'input'    => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span>test</span>',
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                'expected' => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span class="firstTag" foo="bar">test</span>',
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         'expected' => '<![CDATA[This <is> a <strong class="firstTag" foo="bar" id="yes">HTML Tag</strong>]]><span class="secondTag">test</span>',
</ins><span class="cx" style="display: block; padding: 0 10px">                         ),
</span><span class="cx" style="display: block; padding: 0 10px">                );
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span></span></pre>
</div>
</div>

</body>
</html>