[wp-trac] [WordPress Trac] #62091: XML API: Produce XML Serialization of HTML (XHTML)

WordPress Trac noreply at wordpress.org
Fri Sep 27 00:49:01 UTC 2024


#62091: XML API: Produce XML Serialization of HTML (XHTML)
-----------------------------+-----------------------------
 Reporter:  dmsnell          |       Owner:  (none)
     Type:  feature request  |      Status:  new
 Priority:  normal           |   Milestone:  Future Release
Component:  HTML API         |     Version:  trunk
 Severity:  normal           |  Resolution:
 Keywords:  has-patch        |     Focuses:
-----------------------------+-----------------------------
Description changed by dmsnell:

Old description:

> Even though XML cannot represent all possible HTML documents, and even
> though it's dangerous to send XHTML content generally, there are
> extremely rare cases where it's useful to directly embed an HTML document
> into an existing XML document, if the given document //can// be expressed
> in XML.
>
> === What is required to transform when converting HTML to XML? ===
>
>  * HTML void elements like `<img>` should adopt the self-closing flag to
> become `<img />`
>  * HTML text should be decoded and then only `<`, `>`, `&`, `"`, and `'`
> ought to be re-encoded.
>  * Namespace transitions should involve changes to the default namespace.
>     * When entering a foreign element (`SVG` and `MATH`).
>     * When returning to HTML from a foreign element.
>     * When entering HTML integration points, such as `FOREIGNOBJECT` and
> `ANNOTATION-XML` with the proper attribute.
>  * Something has to be done about un-representable characters.
>     * Invalid UTF-8 bytes.
>     * Unicode non-characters and other disallowed characters.
>  * HTML documents which cannot be represented in XML should result in
> rejection - cannot serialize.
>  * The HTML doctype declaration probably needs to be removed.
>
> === Design ===
>
> With the introduction of `WP_HTML_Processor::serialize()` in #62036, an
> XML serialization might appear naturally as
> `WP_HTML_Processor::serialize_to_xml()`. When parsing as a fragment, the
> output may be an XML fragment, while a full parser would produce a valid
> XHTML document including the XML declaration.
>
> ----
>
> Please share your thoughts if you know of other transformations that need
> to occur.
>
> ----
>
> XML and HTML are divergent languages. You probably don't want XHTML. It's
> dangerous.

New description:

 Even though XML cannot represent all possible HTML documents, and even
 though it's dangerous to send XHTML content generally, there are extremely
 rare cases where it's useful to directly embed an HTML document into an
 existing XML document, if the given document //can// be expressed in XML.

 === What is required to transform when converting HTML to XML? ===

  * HTML void elements like `<img>` should adopt the self-closing flag to
 become `<img />`
  * HTML text should be decoded and then only `<`, `>`, `&`, `"`, and `'`
 ought to be re-encoded.
  * Namespace transitions should involve changes to the default namespace.
     * When entering a foreign element (`SVG` and `MATH`).
     * When returning to HTML from a foreign element.
     * When entering HTML integration points, such as `FOREIGNOBJECT` and
 `ANNOTATION-XML` with the proper attribute.
     * Containing element needs namespace prefix on tag name, e.g.
 `<svg:svg>` or `<svg:foreignElement>`, and then we can update the default
 namespace on that element, but because the default namespace doesn't apply
 to attributes, and because namespaced attributes are different than non-
 namespaced attributes, we must leave the attributes un-namespaced.
  * Something has to be done about un-representable characters.
     * Invalid UTF-8 bytes.
     * Unicode non-characters and other disallowed characters.
  * HTML documents which cannot be represented in XML should result in
 rejection - cannot serialize.
  * The HTML doctype declaration probably needs to be removed.

 === Design ===

 With the introduction of `WP_HTML_Processor::serialize()` in #62036, an
 XML serialization might appear naturally as
 `WP_HTML_Processor::serialize_to_xml()`. When parsing as a fragment, the
 output may be an XML fragment, while a full parser would produce a valid
 XHTML document including the XML declaration.

 ----

 Please share your thoughts if you know of other transformations that need
 to occur.

 ----

 XML and HTML are divergent languages. You probably don't want XHTML. It's
 dangerous.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62091#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list