[wp-trac] [WordPress Trac] #59883: Remove support for HTML4 and XHTML

WordPress Trac noreply at wordpress.org
Fri Nov 10 19:20:48 UTC 2023


#59883: Remove support for HTML4 and XHTML
-------------------------+-------------------------------------------------
 Reporter:  dmsnell      |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Awaiting Review
Component:  HTML API     |    Version:  trunk
 Severity:  normal       |   Keywords:  dev-feedback has-dev-note 2nd-
  Focuses:               |  opinion
-------------------------+-------------------------------------------------
 == Summary

 WordPress still officially supports HTML4 and XHTML, but the browsers it
 serves and the broader web effectively don't. Let's remove support so that
 we can modernize the code we right and simplify Core's HTML-handling
 functionality.

 == Background

 This came up recently in #58664 and in an exploration
 [https://github.com/WordPress/wordpress-develop/pull/5337 rewriting
 esc_attr()].

 In various places WordPress maintains the appearance of supporting HTML4,
 for example:

  - `wp_kses_named_entities()` rejects valid named character references
 like `⇵` and in turn corrupts documents containing these
 entities.
  - script and style tags conditionally add `type` attributes that never
 need to be printed
  - widgets selectively render `<nav>` and strip tags out of the `$title`
 for a page when TITLE elements can contain no tags anyway. this leads to
 corruption in the page title for removing what WordPress thinks are tags
 but aren't.
  - various places run `kses` as if serving XHTML, adding needless invalid
 syntax like the self-closing flag on void elements, e.g. `<img />`, `<br
 />`, `<meta />`

 The //appearance// of serving HTML4 or XHTML stems from the fact that it's
 very rare to serve actual XHTML content, and perhaps impossible to serve
 HTML4 content, to any supported browser or enviornment.

  - browsers ignore any `<xml>` or `<!DOCTYPE>` declaration specifying
 HTML4 or XHTML. they interpret a page as HTML5 regardless. you can confirm
 this by visiting a page with the `⟨` named character reference. If
 interpreted as HTML4 it will transform into the U+2329 `〈` code point,
 but if interpreted as HTML5 will transform into the U+27E8 codepoint `⟨`.
  - the only way to serve a page as XHTML is to send the HTTP header
 `Content-type: application/xhtml+xml` or to serve the page with the `.xml`
 file extension in the URL (e.g. serve `index.xml` instead of `index.html`
 or `index.php` or `/index` or `/`). It's not enough to send a `<meta http-
 equiv="content-type" content="application/xhtml+xml">` tag; it //must//
 come through the HTTP headers.

 Because of this behavior in browsers, WordPress sends content that it
 thinks is one thing but is received as another. Removing official support
 means that we can start to remove those places that purport to send HTML4
 or XHTML content when that assumption is wrong and can lead to data
 corruption, let alone needless syntax noise.

 WordPress still serves XML content in RSS feeds; this proposal does not
 recommend removing support for generating the XML feeds, but it may extend
 to the escaping and rendering of embedded HTML within those feeds, since
 an RSS reader is unlikely to and should not be interpreting embedded HTML
 as HTML4 and should be supporting embedded HTML5 as any web browser would.
 As an embedding, the content rendered into the feed remains separate from
 the surrounding RSS XML container.

 == Action plan

 Removing support for HTML4 and XHTML doesn't require any immediate action
 because HTML5 parsers compliantly parse HTML4 and XHTML up to their
 conflicting rules, such as with the `⟨` named character reference.
 Since WordPress is already "broken" in this sense today, removing support
 does not imply that these are new bugs; rather it acknowledges that we
 missed updating WordPress once HTML4 and XHTML properly disappeared.

 In future work it opens up opportunities to modernize WordPress:
  - we don't need to handle complicated corner cases where pre-HTML5
 renders require special cases.
  - we can remove code meant for backwards compatability which no longer
 provides that support.
  - we can update Core functions such as `_wp_kses_named_entities()` to
 prevent them from corrupting data based on inaccurate parsing rules from
 the past.
  - we can define a body of support and scope for what WordPress will and
 won't attempt to clean up. functions like `force_balance_tags()` and
 encoding functions attempt to normalize and sanitize HTML but just as
 often further break that HTML when passing it through to the browser would
 have a deterministic and safe resolution.

 The HTML API is providing WordPress the ability to have a smarter Core
 HTML system that won't be confused by rare or unexpected inputs and leans
 heavily on a spec-compliant "garbage-in garbage-out" approach. This
 dramatically simplifies HTML processing code without opening unsafe
 avenues; this is because HTML5 defines how to handle abnormal inputs.

 cc: @westonruter who ran a query and found up to potentially two sites
 among millions that are serving XHTML content through the inclusion of
 proper HTTP headers.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/59883>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list