[wp-trac] [WordPress Trac] #64607: HTML API: Normalize may remove significant text from PRE, TEXTAREA content

Thu Feb 5 16:29:32 UTC 2026

#64607: HTML API: Normalize may remove significant text from PRE, TEXTAREA content
--------------------------+-----------------------------
 Reporter:  jonsurrell    |      Owner:  (none)
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  HTML API      |    Version:  6.7
 Severity:  normal        |   Keywords:
  Focuses:                |
--------------------------+-----------------------------
 `WP_HTML_Processor::normalize()` will remove a single leading newline
 (`U+000A`) from `TEXTAREA`, `PRE`, and `LISTING` content. It is correct to
 recognize that a leading newline ''is ignored'' inside of these elements.
 However, an HTML-to-HTML normalization function must preserve the semantic
 content. There are cases where stripping a leading new line results in
 differing HTML.

 Consider:

 {{{#!xml
 <!-- original -->
 <textarea>
 </textarea>
 <!-- Normalized -->
 <textarea></textarea>
 }}}

 In this case, stripping the leading newline is harmless because the
 newline is ignored by any spec-compliant HTML parser.

 However, if there are multiple leading newlines, the result is different:

 {{{#!xml
 <!-- original -->
 <textarea>

 </textarea>
 <!-- Normalized -->
 <textarea>
 </textarea>
 }}}

 In this case, the **original** contains a new line when parsed. But, when
 parsing the **normalized** version, the newline is stripped and the
 element is now ''empty'' — it contains no newlines. In this case,
 normalization has failed because the result is not semantically equivalent
 to the input.

 This can be observed by repeatedly calling `::normalize()`. On each call,
 a leading newline is stripped until none remain. Ideally, `::normalize()`
 would be idempotent and calling `::normalize()` on already-normalized HTML
 would result in no changes.

 {{{#!php
 <?php
 $html = <<<HTML
 <textarea>

 </textarea>
 HTML;

 echo "\n{$html}\n";
 $html = WP_HTML_Processor::normalize( $html );
 echo "\n{$html}\n";
 $html = WP_HTML_Processor::normalize( $html );
 echo "\n{$html}\n";
 $html = WP_HTML_Processor::normalize( $html );
 echo "\n{$html}\n";'
 }}}

 Prints the following. Notice that normalization consumes newlines until
 they're exhausted:

 {{{
 <textarea>

 </textarea>

 <textarea>
 </textarea>

 <textarea></textarea>

 <textarea></textarea>
 }}}

 This behavior ''onl'' affects the leading newline inside elements with
 special leading newline behavior: `TEXTAREA`, `PRE`, and `LISTING`.

 [https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-
 fragments This note from the HTML standard is interesting:]

 > For historical reasons, this algorithm does not round-trip an initial
 U+000A (LF) character in `pre`, `textarea`, or `listing` elements, even
 though (in the first two cases) the markup being round-tripped can be
 conforming. The HTML parser will drop such a character during parsing, but
 this algorithm does not serialize an extra U+000A (LF) character.

 This is one case where HTML-to-HTML tooling like the HTML API can do
 better than the standard by ensuring that HTML content is preserved.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/64607>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform