[wp-trac] [WordPress Trac] #64607: HTML API: Normalize may remove significant text from PRE, TEXTAREA content
WordPress Trac
noreply at wordpress.org
Thu Feb 5 16:29:32 UTC 2026
#64607: HTML API: Normalize may remove significant text from PRE, TEXTAREA content
--------------------------+-----------------------------
Reporter: jonsurrell | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: HTML API | Version: 6.7
Severity: normal | Keywords:
Focuses: |
--------------------------+-----------------------------
`WP_HTML_Processor::normalize()` will remove a single leading newline
(`U+000A`) from `TEXTAREA`, `PRE`, and `LISTING` content. It is correct to
recognize that a leading newline ''is ignored'' inside of these elements.
However, an HTML-to-HTML normalization function must preserve the semantic
content. There are cases where stripping a leading new line results in
differing HTML.
Consider:
{{{#!xml
<!-- original -->
<textarea>
</textarea>
<!-- Normalized -->
<textarea></textarea>
}}}
In this case, stripping the leading newline is harmless because the
newline is ignored by any spec-compliant HTML parser.
However, if there are multiple leading newlines, the result is different:
{{{#!xml
<!-- original -->
<textarea>
</textarea>
<!-- Normalized -->
<textarea>
</textarea>
}}}
In this case, the **original** contains a new line when parsed. But, when
parsing the **normalized** version, the newline is stripped and the
element is now ''empty'' — it contains no newlines. In this case,
normalization has failed because the result is not semantically equivalent
to the input.
This can be observed by repeatedly calling `::normalize()`. On each call,
a leading newline is stripped until none remain. Ideally, `::normalize()`
would be idempotent and calling `::normalize()` on already-normalized HTML
would result in no changes.
{{{#!php
<?php
$html = <<<HTML
<textarea>
</textarea>
HTML;
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";'
}}}
Prints the following. Notice that normalization consumes newlines until
they're exhausted:
{{{
<textarea>
</textarea>
<textarea>
</textarea>
<textarea></textarea>
<textarea></textarea>
}}}
This behavior ''onl'' affects the leading newline inside elements with
special leading newline behavior: `TEXTAREA`, `PRE`, and `LISTING`.
[https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-
fragments This note from the HTML standard is interesting:]
> For historical reasons, this algorithm does not round-trip an initial
U+000A (LF) character in `pre`, `textarea`, or `listing` elements, even
though (in the first two cases) the markup being round-tripped can be
conforming. The HTML parser will drop such a character during parsing, but
this algorithm does not serialize an extra U+000A (LF) character.
This is one case where HTML-to-HTML tooling like the HTML API can do
better than the standard by ensuring that HTML content is preserved.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/64607>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list