[wp-trac] [WordPress Trac] #60170: HTML API: Scan every token in an HTML document.

WordPress Trac noreply at wordpress.org
Mon Jan 1 03:39:21 UTC 2024


#60170: HTML API: Scan every token in an HTML document.
-------------------------+-----------------------------
 Reporter:  dmsnell      |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Awaiting Review
Component:  HTML API     |    Version:  trunk
 Severity:  normal       |   Keywords:  has-patch
  Focuses:               |
-------------------------+-----------------------------
 The Tag Processor currently visits every syntax element in an HTML
 document but it only exposes inspecting and modifying tags. In order to
 build certain features it needs to support visiting every kind of syntax
 token. In this patch the HTML Tag Processor exposes a new system for doing
 that.

 Adding full token-scanning opens up new opportunities, including but not
 limited to the following list:

  - A version of `wp_truncate_html()` that only parses as much HTML as is
 necessary and which provides the ability to preserve the HTML tags while
 ignoring their contribution to the HTML string length.
 [https://gist.github.com/dmsnell/11f7165f5c7cd3ca69e4b9d751ff093e gist]
  - Improve the reliability (and potentially the performance of) existing
 functions like `wp_strip_all_tags()`.
  - Introduce new filtering pipelines to coalesce multiple filters into a
 single pass that only operates on text nodes, skipping multiple iterations
 of multiple buggy regular-expression-based transforms.
  - Rendering HTML as text.

 This patch introduces the concept of **modifiable text**, which represents
 plain text inside an HTML document which cannot contain other tags or
 syntax tokens. `#text` nodes are modifiable text, as are the contents of a
 SCRIPT or STYLE element, and also the text inside an HTML comment is
 modifiable text. Modifiable text can be modified without changing the
 structure of the HTML document.

 This patch introduces new methods on the Tag Processor:
  - `get_token_type()` reports what kind of token is currently matched,
 e.g. a tag or a comment.
  - `get_token_name()` roughly reports the DOM name that would be assigned
 to the node in a browser.
  - `get_modifiable_text()` returns the modifiable text for a node..

 There is no ability provided yet to modify the modifiable text. See the
 Github PR for more elaborate descriptions of the new modifiable text
 concept.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60170>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list