<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>[58970] trunk: HTML API: Allow subdividing text nodes by meaningful prefixes.</title>
</head>
<body>

<style type="text/css"><!--
#msg dl.meta { border: 1px #006 solid; background: #369; padding: 6px; color: #fff; }
#msg dl.meta dt { float: left; width: 6em; font-weight: bold; }
#msg dt:after { content:':';}
#msg dl, #msg dt, #msg ul, #msg li, #header, #footer, #logmsg { font-family: verdana,arial,helvetica,sans-serif; font-size: 10pt;  }
#msg dl a { font-weight: bold}
#msg dl a:link    { color:#fc3; }
#msg dl a:active  { color:#ff0; }
#msg dl a:visited { color:#cc6; }
h3 { font-family: verdana,arial,helvetica,sans-serif; font-size: 10pt; font-weight: bold; }
#msg pre { white-space: pre-line; overflow: auto; background: #ffc; border: 1px #fa0 solid; padding: 6px; }
#logmsg { background: #ffc; border: 1px #fa0 solid; padding: 1em 1em 0 1em; }
#logmsg p, #logmsg pre, #logmsg blockquote { margin: 0 0 1em 0; }
#logmsg p, #logmsg li, #logmsg dt, #logmsg dd { line-height: 14pt; }
#logmsg h1, #logmsg h2, #logmsg h3, #logmsg h4, #logmsg h5, #logmsg h6 { margin: .5em 0; }
#logmsg h1:first-child, #logmsg h2:first-child, #logmsg h3:first-child, #logmsg h4:first-child, #logmsg h5:first-child, #logmsg h6:first-child { margin-top: 0; }
#logmsg ul, #logmsg ol { padding: 0; list-style-position: inside; margin: 0 0 0 1em; }
#logmsg ul { text-indent: -1em; padding-left: 1em; }#logmsg ol { text-indent: -1.5em; padding-left: 1.5em; }
#logmsg > ul, #logmsg > ol { margin: 0 0 1em 0; }
#logmsg pre { background: #eee; padding: 1em; }
#logmsg blockquote { border: 1px solid #fa0; border-left-width: 10px; padding: 1em 1em 0 1em; background: white;}
#logmsg dl { margin: 0; }
#logmsg dt { font-weight: bold; }
#logmsg dd { margin: 0; padding: 0 0 0.5em 0; }
#logmsg dd:before { content:'\00bb';}
#logmsg table { border-spacing: 0px; border-collapse: collapse; border-top: 4px solid #fa0; border-bottom: 1px solid #fa0; background: #fff; }
#logmsg table th { text-align: left; font-weight: normal; padding: 0.2em 0.5em; border-top: 1px dotted #fa0; }
#logmsg table td { text-align: right; border-top: 1px dotted #fa0; padding: 0.2em 0.5em; }
#logmsg table thead th { text-align: center; border-bottom: 1px solid #fa0; }
#logmsg table th.Corner { text-align: left; }
#logmsg hr { border: none 0; border-top: 2px dashed #fa0; height: 1px; }
#header, #footer { color: #fff; background: #636; border: 1px #300 solid; padding: 6px; }
#patch { width: 100%; }
#patch h4 {font-family: verdana,arial,helvetica,sans-serif;font-size:10pt;padding:8px;background:#369;color:#fff;margin:0;}
#patch .propset h4, #patch .binary h4 {margin:0;}
#patch pre {padding:0;line-height:1.2em;margin:0;}
#patch .diff {width:100%;background:#eee;padding: 0 0 10px 0;overflow:auto;}
#patch .propset .diff, #patch .binary .diff  {padding:10px 0;}
#patch span {display:block;padding:0 10px;}
#patch .modfile, #patch .addfile, #patch .delfile, #patch .propset, #patch .binary, #patch .copfile {border:1px solid #ccc;margin:10px 0;}
#patch ins {background:#dfd;text-decoration:none;display:block;padding:0 10px;}
#patch del {background:#fdd;text-decoration:none;display:block;padding:0 10px;}
#patch .lines, .info {color:#888;background:#fff;}
--></style>
<div id="msg">
<dl class="meta" style="font-size: 105%">
<dt style="float: left; width: 6em; font-weight: bold">Revision</dt> <dd><a style="font-weight: bold" href="https://core.trac.wordpress.org/changeset/58970">58970</a><script type="application/ld+json">{"@context":"http://schema.org","@type":"EmailMessage","description":"Review this Commit","action":{"@type":"ViewAction","url":"https://core.trac.wordpress.org/changeset/58970","name":"Review Commit"}}</script></dd>
<dt style="float: left; width: 6em; font-weight: bold">Author</dt> <dd>dmsnell</dd>
<dt style="float: left; width: 6em; font-weight: bold">Date</dt> <dd>2024-09-02 23:19:08 +0000 (Mon, 02 Sep 2024)</dd>
</dl>

<pre style='padding-left: 1em; margin: 2em 0; border-left: 2px solid #ccc; line-height: 1.25; font-size: 105%; font-family: sans-serif'>HTML API: Allow subdividing text nodes by meaningful prefixes.

HTML parsing rules at times differentiate character tokens that are all null bytes, all whitespace, or other content. This patch introduces a new function which may be used to classify text node sub-regions and lead to more efficient application of these parsing rules.

Further, when classified in this way, application code may skip some rules and decoding entirely, improving performance. For example, this can be used to ease the implementation of skipping inter-element whitespace, which is usually not rendered.

Developed in https://github.com/WordPress/wordpress-develop/pull/7236
Discussed in https://core.trac.wordpress.org/ticket/61974

Props dmsnell, jonsurrell.
Fixes <a href="https://core.trac.wordpress.org/ticket/61974">#61974</a>.</pre>

<h3>Modified Paths</h3>
<ul>
<li><a href="#trunksrcwpincludeshtmlapiclasswphtmlprocessorphp">trunk/src/wp-includes/html-api/class-wp-html-processor.php</a></li>
<li><a href="#trunksrcwpincludeshtmlapiclasswphtmltagprocessorphp">trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php</a></li>
<li><a href="#trunktestsphpunittestshtmlapiwpHtmlProcessorHtml5libphp">trunk/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php</a></li>
</ul>

</div>
<div id="patch">
<h3>Diff</h3>
<a id="trunksrcwpincludeshtmlapiclasswphtmlprocessorphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/src/wp-includes/html-api/class-wp-html-processor.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/src/wp-includes/html-api/class-wp-html-processor.php        2024-09-02 22:26:22 UTC (rev 58969)
+++ trunk/src/wp-includes/html-api/class-wp-html-processor.php  2024-09-02 23:19:08 UTC (rev 58970)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -843,6 +843,12 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                if ( self::PROCESS_NEXT_NODE === $node_to_process ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        parent::next_token();
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                        if (
+                               WP_HTML_Tag_Processor::STATE_TEXT_NODE === $this->parser_state ||
+                               WP_HTML_Tag_Processor::STATE_CDATA_NODE === $this->parser_state
+                       ) {
+                               parent::subdivide_text_appropriately();
+                       }
</ins><span class="cx" style="display: block; padding: 0 10px">                 }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                // Finish stepping when there are no more tokens in the document.
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1056,8 +1062,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                goto initial_anything_else;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1145,8 +1150,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                goto before_html_anything_else;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1227,8 +1231,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                goto before_head_anything_else;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1323,16 +1326,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * > U+000A LINE FEED (LF), U+000C FORM FEED (FF),
</span><span class="cx" style="display: block; padding: 0 10px">                                 * > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( '' === $text ) {
-                                       /*
-                                        * If the text is empty after processing HTML entities and stripping
-                                        * U+0000 NULL bytes then ignore the token.
-                                        */
-                                       return $this->step();
-                               }
-
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         // Insert the character.
</span><span class="cx" style="display: block; padding: 0 10px">                                        $this->insert_html_element( $this->state->current_token );
</span><span class="cx" style="display: block; padding: 0 10px">                                        return true;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1552,8 +1546,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_head();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1654,8 +1647,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         // Insert the character.
</span><span class="cx" style="display: block; padding: 0 10px">                                        $this->insert_html_element( $this->state->current_token );
</span><span class="cx" style="display: block; padding: 0 10px">                                        return true;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1793,8 +1785,6 @@
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                switch ( $op ) {
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $current_token = $this->bookmarks[ $this->state->current_token->bookmark_name ];
-
</del><span class="cx" style="display: block; padding: 0 10px">                                 /*
</span><span class="cx" style="display: block; padding: 0 10px">                                 * > A character token that is U+0000 NULL
</span><span class="cx" style="display: block; padding: 0 10px">                                 *
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1804,11 +1794,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * here, but if there are any other characters in the stream
</span><span class="cx" style="display: block; padding: 0 10px">                                 * the active formats should be reconstructed.
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                if (
-                                       1 <= $current_token->length &&
-                                       "\x00" === $this->html[ $current_token->start ] &&
-                                       strspn( $this->html, "\x00", $current_token->start, $current_token->length ) === $current_token->length
-                               ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_NULL_SEQUENCE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         // Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                                        return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -1820,8 +1806,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * It is probably inter-element whitespace, but it may also
</span><span class="cx" style="display: block; padding: 0 10px">                                 * contain character references which decode only to whitespace.
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) !== strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_GENERIC === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         $this->state->frameset_ok = false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2829,12 +2814,11 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                                'TR' === $current_node_name
</span><span class="cx" style="display: block; padding: 0 10px">                                        )
</span><span class="cx" style="display: block; padding: 0 10px">                                ) {
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        $text = $this->get_modifiable_text();
</del><span class="cx" style="display: block; padding: 0 10px">                                         /*
</span><span class="cx" style="display: block; padding: 0 10px">                                         * If the text is empty after processing HTML entities and stripping
</span><span class="cx" style="display: block; padding: 0 10px">                                         * U+0000 NULL bytes then ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                                         */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        if ( '' === $text ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 if ( parent::TEXT_IS_NULL_SEQUENCE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                                 return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2857,7 +2841,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                         *
</span><span class="cx" style="display: block; padding: 0 10px">                                         * @see https://html.spec.whatwg.org/#parsing-main-intabletext
</span><span class="cx" style="display: block; padding: 0 10px">                                         */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                        if ( strlen( $text ) === strspn( $text, " \t\f\r\n" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                                 if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                                 $this->insert_html_element( $this->state->current_token );
</span><span class="cx" style="display: block; padding: 0 10px">                                                return true;
</span><span class="cx" style="display: block; padding: 0 10px">                                        }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -3177,16 +3161,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * > U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( '' === $text ) {
-                                       /*
-                                        * If the text is empty after processing HTML entities and stripping
-                                        * U+0000 NULL bytes then ignore the token.
-                                        */
-                                       return $this->step();
-                               }
-
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         // Insert the character.
</span><span class="cx" style="display: block; padding: 0 10px">                                        $this->insert_html_element( $this->state->current_token );
</span><span class="cx" style="display: block; padding: 0 10px">                                        return true;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -3609,8 +3584,6 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * > Any other character token
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $current_token = $this->bookmarks[ $this->state->current_token->bookmark_name ];
-
</del><span class="cx" style="display: block; padding: 0 10px">                                 /*
</span><span class="cx" style="display: block; padding: 0 10px">                                 * > A character token that is U+0000 NULL
</span><span class="cx" style="display: block; padding: 0 10px">                                 *
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -3617,11 +3590,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * If a text node only comprises null bytes then it should be
</span><span class="cx" style="display: block; padding: 0 10px">                                 * entirely ignored and should not return to calling code.
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                if (
-                                       1 <= $current_token->length &&
-                                       "\x00" === $this->html[ $current_token->start ] &&
-                                       strspn( $this->html, "\x00", $current_token->start, $current_token->length ) === $current_token->length
-                               ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_NULL_SEQUENCE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         // Parse error: ignore the token.
</span><span class="cx" style="display: block; padding: 0 10px">                                        return $this->step();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -3986,8 +3955,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * > Process the token using the rules for the "in body" insertion mode.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_body();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                goto after_body_anything_else;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4072,9 +4040,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * them under HTML. This is not supported at this time.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_body();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                $this->bail( 'Non-whitespace characters cannot be handled in frameset.' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4193,8 +4159,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * them under HTML. This is not supported at this time.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_body();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                $this->bail( 'Non-whitespace characters cannot be handled in after frameset' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4288,8 +4253,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * > Process the token using the rules for the "in body" insertion mode.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_body();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                goto after_after_body_anything_else;
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4355,8 +4319,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                         * them under HTML. This is not supported at this time.
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#text':
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_WHITESPACE === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         return $this->step_in_body();
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px">                                $this->bail( 'Non-whitespace characters cannot be handled in after after frameset.' );
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4412,6 +4375,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">                switch ( $op ) {
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                        case '#cdata-section':
</ins><span class="cx" style="display: block; padding: 0 10px">                         case '#text':
</span><span class="cx" style="display: block; padding: 0 10px">                                /*
</span><span class="cx" style="display: block; padding: 0 10px">                                 * > A character token that is U+0000 NULL
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4424,8 +4388,7 @@
</span><span class="cx" style="display: block; padding: 0 10px">                                 * It is probably inter-element whitespace, but it may also
</span><span class="cx" style="display: block; padding: 0 10px">                                 * contain character references which decode only to whitespace.
</span><span class="cx" style="display: block; padding: 0 10px">                                 */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                                $text = $this->get_modifiable_text();
-                               if ( strlen( $text ) !== strspn( $text, " \t\n\f\r" ) ) {
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+                         if ( parent::TEXT_IS_GENERIC === $this->text_node_classification ) {
</ins><span class="cx" style="display: block; padding: 0 10px">                                         $this->state->frameset_ok = false;
</span><span class="cx" style="display: block; padding: 0 10px">                                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4435,7 +4398,6 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        /*
</span><span class="cx" style="display: block; padding: 0 10px">                         * > A comment token
</span><span class="cx" style="display: block; padding: 0 10px">                         */
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                        case '#cdata-section':
</del><span class="cx" style="display: block; padding: 0 10px">                         case '#comment':
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#funky-comment':
</span><span class="cx" style="display: block; padding: 0 10px">                        case '#presumptuous-tag':
</span></span></pre></div>
<a id="trunksrcwpincludeshtmlapiclasswphtmltagprocessorphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php    2024-09-02 22:26:22 UTC (rev 58969)
+++ trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php      2024-09-02 23:19:08 UTC (rev 58970)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -542,6 +542,20 @@
</span><span class="cx" style="display: block; padding: 0 10px">        protected $comment_type = null;
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * What kind of text the matched text node represents, if it was subdivided.
+        *
+        * @see self::TEXT_IS_NULL_SEQUENCE
+        * @see self::TEXT_IS_WHITESPACE
+        * @see self::TEXT_IS_GENERIC
+        * @see self::subdivide_text_appropriately
+        *
+        * @since 6.7.0
+        *
+        * @var string
+        */
+       protected $text_node_classification = self::TEXT_IS_GENERIC;
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * How many bytes from the original HTML document have been read and parsed.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * This value points to the latest byte offset in the input document which
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -2199,16 +2213,17 @@
</span><span class="cx" style="display: block; padding: 0 10px">                        unset( $this->lexical_updates[ $name ] );
</span><span class="cx" style="display: block; padding: 0 10px">                }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                $this->token_starts_at      = null;
-               $this->token_length         = null;
-               $this->tag_name_starts_at   = null;
-               $this->tag_name_length      = null;
-               $this->text_starts_at       = 0;
-               $this->text_length          = 0;
-               $this->is_closing_tag       = null;
-               $this->attributes           = array();
-               $this->comment_type         = null;
-               $this->duplicate_attributes = null;
</del><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         $this->token_starts_at          = null;
+               $this->token_length             = null;
+               $this->tag_name_starts_at       = null;
+               $this->tag_name_length          = null;
+               $this->text_starts_at           = 0;
+               $this->text_length              = 0;
+               $this->is_closing_tag           = null;
+               $this->attributes               = array();
+               $this->comment_type             = null;
+               $this->text_node_classification = self::TEXT_IS_GENERIC;
+               $this->duplicate_attributes     = null;
</ins><span class="cx" style="display: block; padding: 0 10px">         }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -3322,6 +3337,107 @@
</span><span class="cx" style="display: block; padding: 0 10px">        }
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span><span class="cx" style="display: block; padding: 0 10px">        /**
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+         * Subdivides a matched text node or CDATA text node, splitting NULL byte sequences
+        * and decoded whitespace as distinct prefixes.
+        *
+        * Note that once anything that's neither a NULL byte nor decoded whitespace is
+        * encountered, then the remainder of the text node is left intact as generic text.
+        *
+        *  - The HTML Processor uses this to apply distinct rules for different kinds of text.
+        *  - Inter-element whitespace can be detected and skipped with this method.
+        *
+        * Text nodes aren't eagerly subdivided because there's no need to split them unless
+        * decisions are being made on NULL byte sequences or whitespace-only text.
+        *
+        * Example:
+        *
+        *     $processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" );
+        *     true  === $processor->next_token();                   // Text is "Apples & Oranges".
+        *     true  === $processor->subdivide_text_appropriately(); // Text is "".
+        *     true  === $processor->next_token();                   // Text is "Apples & Oranges".
+        *     false === $processor->subdivide_text_appropriately();
+        *
+        *     $processor = new WP_HTML_Tag_Processor( "&#x13; \r\n\tMore" );
+        *     true  === $processor->next_token();                   // Text is "␤ ␤␉More".
+        *     true  === $processor->subdivide_text_appropriately(); // Text is "␤ ␤␉".
+        *     true  === $processor->next_token();                   // Text is "More".
+        *     false === $processor->subdivide_text_appropriately();
+        *
+        * @since 6.7.0
+        *
+        * @return bool Whether the text node was subdivided.
+        */
+       public function subdivide_text_appropriately(): bool {
+               $this->text_node_classification = self::TEXT_IS_GENERIC;
+
+               if ( self::STATE_TEXT_NODE === $this->parser_state ) {
+                       /*
+                        * NULL bytes are treated categorically different than numeric character
+                        * references whose number is zero. `&#x00;` is not the same as `"\x00"`.
+                        */
+                       $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length );
+                       if ( $leading_nulls > 0 ) {
+                               $this->token_length             = $leading_nulls;
+                               $this->text_length              = $leading_nulls;
+                               $this->bytes_already_parsed     = $this->token_starts_at + $leading_nulls;
+                               $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE;
+                               return true;
+                       }
+
+                       /*
+                        * Start a decoding loop to determine the point at which the
+                        * text subdivides. This entails raw whitespace bytes and any
+                        * character reference that decodes to the same.
+                        */
+                       $at  = $this->text_starts_at;
+                       $end = $this->text_starts_at + $this->text_length;
+                       while ( $at < $end ) {
+                               $skipped = strspn( $this->html, " \t\f\r\n", $at, $end - $at );
+                               $at     += $skipped;
+
+                               if ( $at < $end && '&' === $this->html[ $at ] ) {
+                                       $matched_byte_length = null;
+                                       $replacement         = WP_HTML_Decoder::read_character_reference( 'data', $this->html, $at, $matched_byte_length );
+                                       if ( isset( $replacement ) && 1 === strspn( $replacement, " \t\f\r\n" ) ) {
+                                               $at += $matched_byte_length;
+                                               continue;
+                                       }
+                               }
+
+                               break;
+                       }
+
+                       if ( $at > $this->text_starts_at ) {
+                               $new_length                     = $at - $this->text_starts_at;
+                               $this->text_length              = $new_length;
+                               $this->token_length             = $new_length;
+                               $this->bytes_already_parsed     = $at;
+                               $this->text_node_classification = self::TEXT_IS_WHITESPACE;
+                               return true;
+                       }
+
+                       return false;
+               }
+
+               // Unlike text nodes, there are no character references within CDATA sections.
+               if ( self::STATE_CDATA_NODE === $this->parser_state ) {
+                       $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length );
+                       if ( $leading_nulls === $this->text_length ) {
+                               $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE;
+                               return true;
+                       }
+
+                       $leading_ws = strspn( $this->html, " \t\f\r\n", $this->text_starts_at, $this->text_length );
+                       if ( $leading_ws === $this->text_length ) {
+                               $this->text_node_classification = self::TEXT_IS_WHITESPACE;
+                               return true;
+                       }
+               }
+
+               return false;
+       }
+
+       /**
</ins><span class="cx" style="display: block; padding: 0 10px">          * Returns the modifiable text for a matched token, or an empty string.
</span><span class="cx" style="display: block; padding: 0 10px">         *
</span><span class="cx" style="display: block; padding: 0 10px">         * Modifiable text is text content that may be read and changed without
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -4248,4 +4364,35 @@
</span><span class="cx" style="display: block; padding: 0 10px">         * @since 6.5.0
</span><span class="cx" style="display: block; padding: 0 10px">         */
</span><span class="cx" style="display: block; padding: 0 10px">        const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML';
</span><ins style="background-color: #dfd; text-decoration:none; display:block; padding: 0 10px">+
+       /**
+        * Indicates that a span of text may contain any combination of significant
+        * kinds of characters: NULL bytes, whitespace, and others.
+        *
+        * @see self::$text_node_classification
+        * @see self::subdivide_text_appropriately
+        *
+        * @since 6.7.0
+        */
+       const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC';
+
+       /**
+        * Indicates that a span of text comprises a sequence only of NULL bytes.
+        *
+        * @see self::$text_node_classification
+        * @see self::subdivide_text_appropriately
+        *
+        * @since 6.7.0
+        */
+       const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE';
+
+       /**
+        * Indicates that a span of decoded text comprises only whitespace.
+        *
+        * @see self::$text_node_classification
+        * @see self::subdivide_text_appropriately
+        *
+        * @since 6.7.0
+        */
+       const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE';
</ins><span class="cx" style="display: block; padding: 0 10px"> }
</span></span></pre></div>
<a id="trunktestsphpunittestshtmlapiwpHtmlProcessorHtml5libphp"></a>
<div class="modfile"><h4 style="background-color: #eee; color: inherit; margin: 1em 0; padding: 1.3em; font-size: 115%">Modified: trunk/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php</h4>
<pre class="diff"><span>
<span class="info" style="display: block; padding: 0 10px; color: #888">--- trunk/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php    2024-09-02 22:26:22 UTC (rev 58969)
+++ trunk/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php      2024-09-02 23:19:08 UTC (rev 58970)
</span><span class="lines" style="display: block; padding: 0 10px; color: #888">@@ -27,23 +27,17 @@
</span><span class="cx" style="display: block; padding: 0 10px">        const SKIP_TESTS = array(
</span><span class="cx" style="display: block; padding: 0 10px">                'comments01/line0155'    => 'Unimplemented: Need to access raw comment text on non-normative comments.',
</span><span class="cx" style="display: block; padding: 0 10px">                'comments01/line0169'    => 'Unimplemented: Need to access raw comment text on non-normative comments.',
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                'doctype01/line0380'     => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly',
</del><span class="cx" style="display: block; padding: 0 10px">                 'html5test-com/line0129' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
</span><span class="cx" style="display: block; padding: 0 10px">                'noscript01/line0014'    => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                'tests1/line0692'        => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly',
</del><span class="cx" style="display: block; padding: 0 10px">                 'tests14/line0022'       => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests14/line0055'       => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests19/line0488'       => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests19/line0500'       => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                'tests19/line0965'       => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly.',
</del><span class="cx" style="display: block; padding: 0 10px">                 'tests19/line1079'       => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests2/line0207'        => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests2/line0686'        => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests2/line0697'        => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">                'tests2/line0709'        => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><del style="background-color: #fdd; text-decoration:none; display:block; padding: 0 10px">-                'tests5/line0013'        => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly.',
-               'tests5/line0077'        => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly.',
-               'tests5/line0091'        => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly',
</del><span class="cx" style="display: block; padding: 0 10px">                 'webkit01/line0231'      => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
</span><span class="cx" style="display: block; padding: 0 10px">        );
</span><span class="cx" style="display: block; padding: 0 10px"> 
</span></span></pre>
</div>
</div>

</body>
</html>