[wp-trac] [WordPress Trac] #62396: HTML API: serialize should include doctype when present
WordPress Trac
noreply at wordpress.org
Wed Nov 13 16:29:29 UTC 2024
#62396: HTML API: serialize should include doctype when present
-------------------------------------------------+-------------------------
Reporter: jonsurrell | Owner: jonsurrell
Type: defect (bug) | Status: closed
Priority: normal | Milestone: 6.7.1
Component: HTML API | Version: 6.7
Severity: normal | Resolution: fixed
Keywords: has-patch has-unit-tests dev- | Focuses:
feedback fixed-major |
-------------------------------------------------+-------------------------
Comment (by dmsnell):
Thanks for addressing this point, but I would ask we reconsider this
change.
First, thanks for noticing the typo from my original code. I apparently
used the token name `html` in the switch when passing the token type
`#doctype`. This fix should stay.
The change to recreate the original doctype I believe might be a mistake.
It more faithfully recreates the input document, but that's not the goal
of `serialize()`, which is to //fully normalize// the input into a form
that's going to be consistent and predictable.
{{{#!php
<?php
$p = WP_HTML_Processor::create_full_parser( '<!DOCTYPE
dorktype><p>hi<!DOCTYPE html5>' );
echo $p->serialize();
// <!DOCTYPE dorktype><html><head></head><body><p>hi</p></body></html>
}}}
I’m in the anti-dork-type camp here. Now initially I thought this was very
messy and due to the way the HTML Processor understands HTML we would
always want to return standards-mode HTML - this is why it formerly only
returned `<!DOCTYPE html>`. A little investigation reveals a more nuanced
need, but one for which I think we have precedent.
//We ought to preserve the mode of the input document// but I don't think
we should preserve the PUBLIC and SYSTEM identifiers. That is, **only
return one of two predefined DOCTYPE declarations**.
If we have a standards document input, we return `<!DOCTYPE html>`. If we
get a quirks-mode document input, we return something like `<!DOCTYPE html
PUBLIC "-//WordPress/HTML/quirks-mode">` to fully identify the document as
such.
----
For quirks-mode input I was thinking of the TABLE-closes-a-P rule. If we
print out that serialized document as standards-compliant then we end up
reshaping the document structure plus we produce an extra empty P element
upon re-parsing. So we have to stay there. The more obvious thing I should
have considered is the CSS names and ASCII case sensitivity.
Regardless, what do you think about printing only a pre-defined DOCTYPE
for quirks mode? We could reuse one of the W3C ones, but I think creating
our own would also be nice and clearly communicate what's going on.
> Fun fact: the `-//` prefix in the public identifier indicates that what
follows is an unregistered owner. That is, since this isn't referring to
an ISO standards document, it’s unregistered. Otherwise it would be `+//`.
SGML lore 🙂
--
Ticket URL: <https://core.trac.wordpress.org/ticket/62396#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list