[wp-trac] [WordPress Trac] #62396: HTML API: serialize should include doctype when present

WordPress Trac noreply at wordpress.org
Wed Nov 13 16:29:29 UTC 2024


#62396: HTML API: serialize should include doctype when present
-------------------------------------------------+-------------------------
 Reporter:  jonsurrell                           |       Owner:  jonsurrell
     Type:  defect (bug)                         |      Status:  closed
 Priority:  normal                               |   Milestone:  6.7.1
Component:  HTML API                             |     Version:  6.7
 Severity:  normal                               |  Resolution:  fixed
 Keywords:  has-patch has-unit-tests dev-        |     Focuses:
  feedback fixed-major                           |
-------------------------------------------------+-------------------------

Comment (by dmsnell):

 Thanks for addressing this point, but I would ask we reconsider this
 change.

 First, thanks for noticing the typo from my original code. I apparently
 used the token name `html` in the switch when passing the token type
 `#doctype`. This fix should stay.

 The change to recreate the original doctype I believe might be a mistake.
 It more faithfully recreates the input document, but that's not the goal
 of `serialize()`, which is to //fully normalize// the input into a form
 that's going to be consistent and predictable.

 {{{#!php
 <?php
 $p = WP_HTML_Processor::create_full_parser( '<!DOCTYPE
 dorktype><p>hi<!DOCTYPE html5>' );
 echo $p->serialize();
 // <!DOCTYPE dorktype><html><head></head><body><p>hi</p></body></html>
 }}}

 I’m in the anti-dork-type camp here. Now initially I thought this was very
 messy and due to the way the HTML Processor understands HTML we would
 always want to return standards-mode HTML - this is why it formerly only
 returned `<!DOCTYPE html>`. A little investigation reveals a more nuanced
 need, but one for which I think we have precedent.

 //We ought to preserve the mode of the input document// but I don't think
 we should preserve the PUBLIC and SYSTEM identifiers. That is, **only
 return one of two predefined DOCTYPE declarations**.

 If we have a standards document input, we return `<!DOCTYPE html>`. If we
 get a quirks-mode document input, we return something like `<!DOCTYPE html
 PUBLIC "-//WordPress/HTML/quirks-mode">` to fully identify the document as
 such.

 ----

 For quirks-mode input I was thinking of the TABLE-closes-a-P rule. If we
 print out that serialized document as standards-compliant then we end up
 reshaping the document structure plus we produce an extra empty P element
 upon re-parsing. So we have to stay there. The more obvious thing I should
 have considered is the CSS names and ASCII case sensitivity.

 Regardless, what do you think about printing only a pre-defined DOCTYPE
 for quirks mode? We could reuse one of the W3C ones, but I think creating
 our own would also be nice and clearly communicate what's going on.

 > Fun fact: the `-//` prefix in the public identifier indicates that what
 follows is an unregistered owner. That is, since this isn't referring to
 an ISO standards document, it’s unregistered. Otherwise it would be `+//`.
 SGML lore 🙂

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62396#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list