[wp-trac] [WordPress Trac] #60698: Token Map: Introduce an efficient lookup and translation class for string mappings. (was: Add optimized set lookup class.)

Thu May 9 02:10:56 UTC 2024

#60698: Token Map: Introduce an efficient lookup and translation class for string
mappings.
----------------------------------------+---------------------
 Reporter:  dmsnell                     |       Owner:  (none)
     Type:  feature request             |      Status:  new
 Priority:  normal                      |   Milestone:  6.6
Component:  General                     |     Version:  trunk
 Severity:  normal                      |  Resolution:
 Keywords:  has-patch needs-unit-tests  |     Focuses:
----------------------------------------+---------------------
Changes (by dmsnell):

 * keywords:   => has-patch needs-unit-tests
 * version:  6.5 => trunk
 * milestone:  Awaiting Review => 6.6

Old description:

> In the course of exploratory development in the HTML API there have been
> a few times where I wanted to test if a given string is in a set of
> statically-known strings, and a few times where I wanted to check if the
> next span of text represents an item in the set.
>
> For the first case, `in_array()` is a suitable method, but isn't always
> ideal when the test set is large.
>
> {{{#!php
> <?php
> if ( in_array( '&notin', $html5_named_character_references, true ) )
> }}}
>
> For the second case, `in_array()` isn't adequate, and a more complicated
> lookup is necessary.
>
> {{{#!php
> <?php
> foreach ( $html5_named_character_references as $name ) {
>         if ( 0 === substr_compare( $html, $name, $at, /* length */ null,
> /* case insensitive */ true ) ) {
>                 …
>                 return $name;
>         }
> }
> }}}
>
> This second example demonstrates some catastrophic lookup characteristics
> when it's not certain if the following input is any token from the set,
> let alone which one it might be. The at-hand code has to iterate the
> search domain and then compare every single possibility against the input
> string, bailing when one is found.
>
> While reviewing code in various places I've noticed a similar pattern and
> need, mostly being served by `in_array()` and a regex that splits apart
> an input string into a large array, allocating substrings for each array
> element, and then calling `in_array()` inside the regex callback (or when
> the results array is passed to another function). This is all memory
> heavy and inefficient in the runtime.
>

> ----
>

> I'd like to propose a new class whose semantic is a relatively static set
> of terms or tokens which provides a test for membership within the set,
> and what the next matching term or token is at a given offset in a
> string, if the next sequence of characters matches one.
>
> {{{#!php
> <?php
> $named_character_references = WP_Token_Set( [ '&notin', '∉',
> '&', … ] );
>
> if ( $named_character_references->contains( '&notin' ) ) {
>         …
> }
>
> while ( true ) {
>         $was_at = $at;
>         $at = strpos( $text, '&', $at );
>         if ( false === $at ) {
>                 $output .= substr( $text, $was_at )
>                 break;
>         }
>
>         $name = $named_character_reference->read_token( $text, $at );
>         if ( false !== $name ) {
>                 $output .= substr( $text, $was_at, $at - $was_at );
>                 $output .= $named_character_replacements[ $name ];
>                 $at     += strlen( $name );
>                 continue;
>         }
>
>         // No named character reference was found, continue searching.
>         ++$at;
> }
> }}}
>
> ----
>
> Further, because WordPress largely deals with large and relatively static
> token sets (named character references, allowable URL schemes, file
> types, loaded templates, etc…), it would be nice to be able to precompute
> the lookup tables if they are at all costly, as doing so on every PHP
> load is unnecessarily burdensome.
>
> A bonus feature would be a method to add and a method to remove terms.
>
> ----
>
> In [https://github.com/WordPress/wordpress-develop/pull/5373 #5373] I
> have proposed such a `WP_Token_Set` and used it in
> [https://github.com/WordPress/wordpress-develop/pull/5337 #5337] to
> create a spec-compliant, low-memory-overhead, and efficient replacement
> for `esc_attr()`.
>
> The replacement `esc_attr()` is able to more reliably parse attribute
> values than the existing code and it does so more efficiently, avoiding
> numerous memory allocations and lookups.

New description:

 Motivated by the need to properly transform HTML named character
 references (like `&`) I found the need for a new semantic class which
 can efficiently perform search and replacement of a set of static tokens.
 Existing patterns in the codebase are not sufficient for the HTML need,
 and I suspect there are other use-cases where this class would help.

 == How does WordPress typically perform this kind of substitution?

 === An associative array with a generated regular expression callback.

 It's common to see a pattern like this, where the lookup is stored as an
 array and a regular expression is built from it.

 {{{#!php
 <?php
 $smilies = array(
   ':(' => "\xf0\x9f\x99\x81",
   ':)' => "\xf0\x9f\x99\x82",
   ':?' => "\xf0\x9f\x98\x95",
   ':D' => "\xf0\x9f\x98\x80",
 );

 preg_replace_callback(
         '/' . implode( '|', array_keys( $smilies ) ) . '/',
         function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0]
 ]; }
 );
 }}}

 This method constructs a large regular expression pattern on every request
 and it's not always clear what performance implications there might be
 with the long list of alternation groups. These can lead to some
 cataclysmic runtimes and the code surrounding these to setup the regular
 expression pattern and replacments tend to appear somewhat exotic, masking
 the intent of the original goal of the code.

 === Calls to `str_replace()` with two arrays.

 {{{#!php
 <?php
 $chars = array(
         'ª' => 'a',
         'º' => 'o',
         'À' => 'A',
         'Á' => 'A',
         'Â' => 'A',
         'Ã' => 'A',
         'Ä' => 'A',
 );

 str_replace( array_keys( $chars ), array_values( $chars ), $text );
 }}}

 This method is clear in the code, but the runtime can be very inefficient
 due to the fact that it's required to iterate over all of the array keys
 until a string match is found. Like the regular expression, performance
 can degrade quickly when there are hundreds or thousands of replacements.

 === Scanning with array lookup.

 Sometimes code is crawling through a string and testing for set membership
 using an array.

 {{{#!php
 <?php
 preg_replace(
         '~&([^;]+;)~',
         function ( $name ) use ( $html_entities ) {
                 $allowed = in_array( $name[0], $html_entities, true ) );
                 return $allowed ? $name : "&{$name[1]}";
         }
 );
 }}}

 This can work if the tokens follow a searchable pattern, but it's hard to
 apply complicated rules for the token pattern with the regular expression
 syntax, and this method also suffers from slow set-membership checking in
 the lookup array. For small arrays it's fine, but  not for large ones,
 especially if the document contains a high number of the input tokens.

 ----

 All of these existing methods work best when it's clear that a given
 string contains one of the tokens, but what if it's not certain? The
 performance characteristics of the array-based approaches are worst when
 the document is missing the input tokens. These approaches are memory
 heavy and inefficient computationally.

 ----

 == A different approach.

 I'm proposing a new class whose semantic is a relatively static mapping of
 terms or tokens to replacement strings. This class provides a number of
 purpose-specific methods supporting token-replacement operations,
 including:

 - Fast set-membership testing, to determine if a given string is one of
 the tokens.
 - Fast mapping from a given token to its replacement value.
 - Determining if one of the tokens exists at a specific offset, and if so,
 what it is and what its mapping is.

 {{{#!php
 <?php
 $named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉',
 'notin;' => '∉', 'amp;' => '&', … ] );

 $matched_string_snippet = '&notin';
 if ( $named_character_references->contains( substr(
 $matched_string_snippet, 1 ) ) ) {
         // "&notin" is a valid token
 }

 // Replace all named character references with their replacements.
 while ( true ) {
         $was_at = $at;
         $at = strpos( $text, '&', $at );
         if ( false === $at ) {
                 $output .= substr( $text, $was_at )
                 break;
         }

         $name_length = 0;
         $replacement = $named_character_reference->read_token( $text, $at,
 $name_length );
         if ( false !== $replacement ) {
                 $output .= substr( $text, $was_at, $at - $was_at );
                 $output .= $replacement;
                 $at     += $name_length;
                 continue;
         }

         // No named character reference was found, continue searching.
         ++$at;
 }
 }}}

 In this example the leading `&` has been pruned from the lookup table to
 save space and time, but it very well could be added into the table.

 Because this class introduces specific semantic meaning, that is, looking
 up and mapping static string tokens to static string replacements, we can
 employ a number of optimizations to reduce the overall memory footprint
 and also to optimize the runtime. Such optimizations have been
 incorporated into the linked PR.

 ----

 In [https://github.com/WordPress/wordpress-develop/pull/6387 #6387] I have
 built a spec-compliant HTML text decoder which utilizes this token map.
 The performance of the new decoder is approximately 20% slower than
 calling `html_entity_decode()` directly, except it properly decodes what
 PHP can't. In fact, the performance bottleneck in that work is not even in
 the token map, but in converting a sequence of digits into a UTF-8 string
 from the given code point.

 == Other potential uses

  - Converting smilies like `:)`
  - Converting emoji sequences like `:happy:`
  - Determining if a given verb/action is allowed in an API call.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60698#comment:1>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform