[wp-trac] [WordPress Trac] #57575: Editor: Introduce HTML Tag Processor

Sat Jan 28 01:49:17 UTC 2023

#57575: Editor: Introduce HTML Tag Processor
-------------------------+--------------------
 Reporter:  azaozz       |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  6.2
Component:  Editor       |    Version:
 Severity:  normal       |   Keywords:
  Focuses:               |
-------------------------+--------------------
 This pulls in the HTML Tag Processor from the Gutenbeg repository (pulled
 from
 [https://github.com/WordPress/gutenberg/commit/98ce5c53c9492764aeff88852a4073fb32e784ac
 WordPress/gutenberg at 98ce5c5]). The Tag Processor attempts to be an HTML5
 -spec-compliant parser that provides the ability in PHP to find specific
 HTML tags and then add, remove, or update attributes on that tag. It
 provides a safe and reliable way to modify the attribute on HTML tags.

 {{{
 // Add missing `rel` attribute to links.
 $p = new WP_HTML_Tag_Processor( $block_content );
 if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) {
     $p->set_attribute( 'noopener nofollow' );
 }
 return $p->get_updated_html();
 }}}

 Introduced originally in
 [https://github.com/WordPress/gutenberg/pull/42485
 WordPress/gutenberg#42485] and developed within the Gutenberg repository,
 this HTML parsing system was built in order to address a persistent need
 (properly modifying HTML tag attributes) and was motivated after a
 sequence of block editor defects which stemmed from mismatches between
 actual HTML code and expectectations for HTML input running through
 existing naive string-search-based solutions.

 The Tag Processor is intended to operate fast enough to avoid being an
 obstacle on page render while using as little memory overhead as possible.
 It is practically a zero-memory-overhead system, and only allocates memory
 as changes to the input HTML document are enqueued, releasing that memory
 when flushing those changes to the document, moving on to find the next
 tag, or flushing its entire output via `get_updated_html()`.

 Rigor has been taken to ensure that the Tag Processor will not be
 consfused by unexpected or non-normative HTML input, including issues
 arising from quoting, from different syntax rules within `<title>`,
 `<textarea>`, and `<script>` tags, from the appearance of rare but
 legitimate comment and XML-like regions, and from a variety of syntax
 abnormalities such as unbalanced tags, incomplete syntax, and overlapping
 tags.

 The Tag Processor is constrained to parsing an HTML document as a stream
 of tokens. It will not build an HTML tree or generate a DOM representation
 of a document. It is designed to start at the beginning of an HTML
 document and linearly scan through it, potentially modifying that document
 as it scans. It has no access to the markup inside or around tags and it
 has no ability to determine which tag openers and tag closers belong to
 each other, or determine the nesting depth of a given tag.

 It includes a primitive bookmarking system to remember tags it has
 previously visited. These bookmarks refer to specific tags, not to string
 offsets, and continue to point to the same place in the document as edits
 are applied. By asking the Tag Processor to seek to a given bookmark it's
 possible to back up and continue processsing again content that has
 already been traversed.

 Attribute values are sanitized with `esc_attr()` and rendered as double-
 quoted attributes. On read they are unescaped and unquoted. Authors
 wishing to rely on the Tag Processor therefore are free to pass around
 data as normal strings.

 Convenience methods for adding and removing CSS class names exist in order
 to remove the need to process the `class` attribute.

 {{{
 // Update heading block class names
 $p = new WP_HTML_Tag_Processor( $html );
 while ( $p->next_tag() ) {
     switch ( $p->get_tag() ) {
         case 'H1':
         case 'H2':
         case 'H3':
         case 'H4':
         case 'H5':
         case 'H6':
             $p->remove_class( 'wp-heading' );
             $p->add_class( 'wp-block-heading' );
             break;
 }
 return $p->get_updated_html();
 }}}

 The Tag Processor is intended to be a reliable low-level library for
 traversing HTML documents and higher-level APIs are to be built upon it.
 Immediately, and in Core Gutenberg blocks it is meant to replace HTML
 modification that currently relies on RegExp patterns and simpler string
 replacements.

 See the following for examples of such replacement:

 -
 [https://github.com/WordPress/gutenberg/commit/13157844ac9aa4307a7a1e3abc54d0f7b0c333cd
 WordPress/gutenberg at 1315784]
 - https://github.com/WordPress/gutenberg/pull/45469/files#diff
 - [https://github.com/WordPress/gutenberg/pull/46625 Update/block support
 settings use tag processor gutenberg#46625]

 By @dmsnell, co-authored-by: @zieladam, @bernhard-reiter, @gziolo.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/57575>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform