[wp-trac] [WordPress Trac] #60698: Token Map: Introduce an efficient lookup and translation class for string mappings.
WordPress Trac
noreply at wordpress.org
Fri May 24 23:42:58 UTC 2024
#60698: Token Map: Introduce an efficient lookup and translation class for string
mappings.
----------------------------------------------------+----------------------
Reporter: dmsnell | Owner: dmsnell
Type: feature request | Status: closed
Priority: normal | Milestone: 6.6
Component: General | Version: trunk
Severity: normal | Resolution: fixed
Keywords: has-patch needs-unit-tests 2nd-opinion | Focuses:
----------------------------------------------------+----------------------
Description changed by dmsnell:
Old description:
> Motivated by the need to properly transform HTML named character
> references (like `&`) I found the need for a new semantic class which
> can efficiently perform search and replacement of a set of static tokens.
> Existing patterns in the codebase are not sufficient for the HTML need,
> and I suspect there are other use-cases where this class would help.
>
> It may be helpful to review the dev note draft before reviewing this
> work:
> https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b
>
> == How does WordPress typically perform this kind of substitution?
>
> === An associative array with a generated regular expression callback.
>
> It's common to see a pattern like this, where the lookup is stored as an
> array and a regular expression is built from it.
>
> {{{#!php
> <?php
> $smilies = array(
> ':(' => "\xf0\x9f\x99\x81",
> ':)' => "\xf0\x9f\x99\x82",
> ':?' => "\xf0\x9f\x98\x95",
> ':D' => "\xf0\x9f\x98\x80",
> );
>
> preg_replace_callback(
> '/' . implode( '|', array_keys( $smilies ) ) . '/',
> function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0]
> ]; }
> );
> }}}
>
> This method constructs a large regular expression pattern on every
> request and it's not always clear what performance implications there
> might be with the long list of alternation groups. These can lead to some
> cataclysmic runtimes and the code surrounding these to setup the regular
> expression pattern and replacments tend to appear somewhat exotic,
> masking the intent of the original goal of the code.
>
> === Calls to `str_replace()` with two arrays.
>
> {{{#!php
> <?php
> $chars = array(
> 'ª' => 'a',
> 'º' => 'o',
> 'À' => 'A',
> 'Á' => 'A',
> 'Â' => 'A',
> 'Ã' => 'A',
> 'Ä' => 'A',
> );
>
> str_replace( array_keys( $chars ), array_values( $chars ), $text );
> }}}
>
> This method is clear in the code, but the runtime can be very inefficient
> due to the fact that it's required to iterate over all of the array keys
> until a string match is found. Like the regular expression, performance
> can degrade quickly when there are hundreds or thousands of replacements.
>
> === Scanning with array lookup.
>
> Sometimes code is crawling through a string and testing for set
> membership using an array.
>
> {{{#!php
> <?php
> preg_replace(
> '~&([^;]+;)~',
> function ( $name ) use ( $html_entities ) {
> $allowed = in_array( $name[0], $html_entities, true ) );
> return $allowed ? $name : "&{$name[1]}";
> }
> );
> }}}
>
> This can work if the tokens follow a searchable pattern, but it's hard to
> apply complicated rules for the token pattern with the regular expression
> syntax, and this method also suffers from slow set-membership checking in
> the lookup array. For small arrays it's fine, but not for large ones,
> especially if the document contains a high number of the input tokens.
>
> ----
>
> All of these existing methods work best when it's clear that a given
> string contains one of the tokens, but what if it's not certain? The
> performance characteristics of the array-based approaches are worst when
> the document is missing the input tokens. These approaches are memory
> heavy and inefficient computationally.
>
> ----
>
> == A different approach.
>
> I'm proposing a new class whose semantic is a relatively static mapping
> of terms or tokens to replacement strings. This class provides a number
> of purpose-specific methods supporting token-replacement operations,
> including:
>
> - Fast set-membership testing, to determine if a given string is one of
> the tokens.
> - Fast mapping from a given token to its replacement value.
> - Determining if one of the tokens exists at a specific offset, and if
> so, what it is and what its mapping is.
>
> {{{#!php
> <?php
> $named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉',
> 'notin;' => '∉', 'amp;' => '&', … ] );
>
> $matched_string_snippet = '¬in';
> if ( $named_character_references->contains( substr(
> $matched_string_snippet, 1 ) ) ) {
> // "¬in" is a valid token
> }
>
> // Replace all named character references with their replacements.
> while ( true ) {
> $was_at = $at;
> $at = strpos( $text, '&', $at );
> if ( false === $at ) {
> $output .= substr( $text, $was_at )
> break;
> }
>
> $name_length = 0;
> $replacement = $named_character_reference->read_token( $text,
> $at, $name_length );
> if ( false !== $replacement ) {
> $output .= substr( $text, $was_at, $at - $was_at );
> $output .= $replacement;
> $at += $name_length;
> continue;
> }
>
> // No named character reference was found, continue searching.
> ++$at;
> }
> }}}
>
> In this example the leading `&` has been pruned from the lookup table to
> save space and time, but it very well could be added into the table.
>
> Because this class introduces specific semantic meaning, that is, looking
> up and mapping static string tokens to static string replacements, we can
> employ a number of optimizations to reduce the overall memory footprint
> and also to optimize the runtime. Such optimizations have been
> incorporated into the linked PR.
>
> ----
>
> In [https://github.com/WordPress/wordpress-develop/pull/6387 #6387] I
> have built a spec-compliant HTML text decoder which utilizes this token
> map. The performance of the new decoder is approximately 20% slower than
> calling `html_entity_decode()` directly, except it properly decodes what
> PHP can't. In fact, the performance bottleneck in that work is not even
> in the token map, but in converting a sequence of digits into a UTF-8
> string from the given code point.
>
> My proposal is adding a new class `WP_Token_Map` providing at least two
> methods for normal use:
>
> - `contains( $token )` returns whether the passed string is in the set.
> - `read_token( $text, $offset = 0, $skip_bytes )` indicates if the
> character sequence starting at the given offset in the passed string
> forms a token in the set, and if so, returns the replacement for that
> token. It also sets `&$skip_bytes` to the length of the token so that
> calling code .
>
> It also provides utility functions for pre-computing these classes, as
> they are designed for relatively-static cases where the actual code is
> intended to be generated dynamically, but stay static over time. For
> example, HTML5 defines the set of named character references and
> indicates that the list //shall not// change or be expanded.
> [https://html.spec.whatwg.org/#named-character-references-table HTML5
> spec]. Precomputing can save on the startup-up cost of building the
> optimized lookup tables.
>
> - `static::from_array( array $mappings )` generates a new token map from
> the given array of whose keys are tokens and whose values are the
> replacements.
> - `to_array()` dumps the set of mapping into an array suitable for
> passing back into `from_array()`.
> - `static::from_precomputed_table( ...$table )` instantiates a token set
> from a precomputed table, skipping the computation for building the table
> and sorting the tokens.
> - `precomputed_php_source_table()` generates PHP source code which can
> be loaded with the previous static method for maintenance of the core
> static token sets.
>
> == Other potential uses
>
> - Converting smilies like `:)`
> - Converting emoji sequences like `:happy:`
> - Determining if a given verb/action is allowed in an API call.
New description:
Motivated by the need to properly transform HTML named character
references (like `&`) I found the need for a new semantic class which
can efficiently perform search and replacement of a set of static tokens.
Existing patterns in the codebase are not sufficient for the HTML need,
and I suspect there are other use-cases where this class would help.
It may be helpful to review the dev note draft before reviewing this work:
https://make.wordpress.org/core/?p=113042&preview=1&_ppp=fbe827d6b7
== How does WordPress typically perform this kind of substitution?
=== An associative array with a generated regular expression callback.
It's common to see a pattern like this, where the lookup is stored as an
array and a regular expression is built from it.
{{{#!php
<?php
$smilies = array(
':(' => "\xf0\x9f\x99\x81",
':)' => "\xf0\x9f\x99\x82",
':?' => "\xf0\x9f\x98\x95",
':D' => "\xf0\x9f\x98\x80",
);
preg_replace_callback(
'/' . implode( '|', array_keys( $smilies ) ) . '/',
function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0]
]; }
);
}}}
This method constructs a large regular expression pattern on every request
and it's not always clear what performance implications there might be
with the long list of alternation groups. These can lead to some
cataclysmic runtimes and the code surrounding these to setup the regular
expression pattern and replacments tend to appear somewhat exotic, masking
the intent of the original goal of the code.
=== Calls to `str_replace()` with two arrays.
{{{#!php
<?php
$chars = array(
'ª' => 'a',
'º' => 'o',
'À' => 'A',
'Á' => 'A',
'Â' => 'A',
'Ã' => 'A',
'Ä' => 'A',
);
str_replace( array_keys( $chars ), array_values( $chars ), $text );
}}}
This method is clear in the code, but the runtime can be very inefficient
due to the fact that it's required to iterate over all of the array keys
until a string match is found. Like the regular expression, performance
can degrade quickly when there are hundreds or thousands of replacements.
=== Scanning with array lookup.
Sometimes code is crawling through a string and testing for set membership
using an array.
{{{#!php
<?php
preg_replace(
'~&([^;]+;)~',
function ( $name ) use ( $html_entities ) {
$allowed = in_array( $name[0], $html_entities, true ) );
return $allowed ? $name : "&{$name[1]}";
}
);
}}}
This can work if the tokens follow a searchable pattern, but it's hard to
apply complicated rules for the token pattern with the regular expression
syntax, and this method also suffers from slow set-membership checking in
the lookup array. For small arrays it's fine, but not for large ones,
especially if the document contains a high number of the input tokens.
----
All of these existing methods work best when it's clear that a given
string contains one of the tokens, but what if it's not certain? The
performance characteristics of the array-based approaches are worst when
the document is missing the input tokens. These approaches are memory
heavy and inefficient computationally.
----
== A different approach.
I'm proposing a new class whose semantic is a relatively static mapping of
terms or tokens to replacement strings. This class provides a number of
purpose-specific methods supporting token-replacement operations,
including:
- Fast set-membership testing, to determine if a given string is one of
the tokens.
- Fast mapping from a given token to its replacement value.
- Determining if one of the tokens exists at a specific offset, and if so,
what it is and what its mapping is.
{{{#!php
<?php
$named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉',
'notin;' => '∉', 'amp;' => '&', … ] );
$matched_string_snippet = '¬in';
if ( $named_character_references->contains( substr(
$matched_string_snippet, 1 ) ) ) {
// "¬in" is a valid token
}
// Replace all named character references with their replacements.
while ( true ) {
$was_at = $at;
$at = strpos( $text, '&', $at );
if ( false === $at ) {
$output .= substr( $text, $was_at )
break;
}
$name_length = 0;
$replacement = $named_character_reference->read_token( $text, $at,
$name_length );
if ( false !== $replacement ) {
$output .= substr( $text, $was_at, $at - $was_at );
$output .= $replacement;
$at += $name_length;
continue;
}
// No named character reference was found, continue searching.
++$at;
}
}}}
In this example the leading `&` has been pruned from the lookup table to
save space and time, but it very well could be added into the table.
Because this class introduces specific semantic meaning, that is, looking
up and mapping static string tokens to static string replacements, we can
employ a number of optimizations to reduce the overall memory footprint
and also to optimize the runtime. Such optimizations have been
incorporated into the linked PR.
----
In [https://github.com/WordPress/wordpress-develop/pull/6387 #6387] I have
built a spec-compliant HTML text decoder which utilizes this token map.
The performance of the new decoder is approximately 20% slower than
calling `html_entity_decode()` directly, except it properly decodes what
PHP can't. In fact, the performance bottleneck in that work is not even in
the token map, but in converting a sequence of digits into a UTF-8 string
from the given code point.
My proposal is adding a new class `WP_Token_Map` providing at least two
methods for normal use:
- `contains( $token )` returns whether the passed string is in the set.
- `read_token( $text, $offset = 0, $skip_bytes )` indicates if the
character sequence starting at the given offset in the passed string forms
a token in the set, and if so, returns the replacement for that token. It
also sets `&$skip_bytes` to the length of the token so that calling code .
It also provides utility functions for pre-computing these classes, as
they are designed for relatively-static cases where the actual code is
intended to be generated dynamically, but stay static over time. For
example, HTML5 defines the set of named character references and indicates
that the list //shall not// change or be expanded.
[https://html.spec.whatwg.org/#named-character-references-table HTML5
spec]. Precomputing can save on the startup-up cost of building the
optimized lookup tables.
- `static::from_array( array $mappings )` generates a new token map from
the given array of whose keys are tokens and whose values are the
replacements.
- `to_array()` dumps the set of mapping into an array suitable for
passing back into `from_array()`.
- `static::from_precomputed_table( ...$table )` instantiates a token set
from a precomputed table, skipping the computation for building the table
and sorting the tokens.
- `precomputed_php_source_table()` generates PHP source code which can be
loaded with the previous static method for maintenance of the core static
token sets.
== Other potential uses
- Converting smilies like `:)`
- Converting emoji sequences like `:happy:`
- Determining if a given verb/action is allowed in an API call.
--
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60698#comment:28>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list