[wp-trac] [WordPress Trac] #60170: HTML API: Scan every token in an HTML document.
WordPress Trac
noreply at wordpress.org
Mon Jan 1 03:39:21 UTC 2024
#60170: HTML API: Scan every token in an HTML document.
-------------------------+-----------------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting Review
Component: HTML API | Version: trunk
Severity: normal | Keywords: has-patch
Focuses: |
-------------------------+-----------------------------
The Tag Processor currently visits every syntax element in an HTML
document but it only exposes inspecting and modifying tags. In order to
build certain features it needs to support visiting every kind of syntax
token. In this patch the HTML Tag Processor exposes a new system for doing
that.
Adding full token-scanning opens up new opportunities, including but not
limited to the following list:
- A version of `wp_truncate_html()` that only parses as much HTML as is
necessary and which provides the ability to preserve the HTML tags while
ignoring their contribution to the HTML string length.
[https://gist.github.com/dmsnell/11f7165f5c7cd3ca69e4b9d751ff093e gist]
- Improve the reliability (and potentially the performance of) existing
functions like `wp_strip_all_tags()`.
- Introduce new filtering pipelines to coalesce multiple filters into a
single pass that only operates on text nodes, skipping multiple iterations
of multiple buggy regular-expression-based transforms.
- Rendering HTML as text.
This patch introduces the concept of **modifiable text**, which represents
plain text inside an HTML document which cannot contain other tags or
syntax tokens. `#text` nodes are modifiable text, as are the contents of a
SCRIPT or STYLE element, and also the text inside an HTML comment is
modifiable text. Modifiable text can be modified without changing the
structure of the HTML document.
This patch introduces new methods on the Tag Processor:
- `get_token_type()` reports what kind of token is currently matched,
e.g. a tag or a comment.
- `get_token_name()` roughly reports the DOM name that would be assigned
to the node in a browser.
- `get_modifiable_text()` returns the modifiable text for a node..
There is no ability provided yet to modify the modifiable text. See the
Github PR for more elaborate descriptions of the new modifiable text
concept.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60170>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list