[wp-trac] [WordPress Trac] #60698: Token Map: Introduce an efficient lookup and translation class for string mappings.

Fri May 24 23:42:58 UTC 2024

#60698: Token Map: Introduce an efficient lookup and translation class for string
mappings.
----------------------------------------------------+----------------------
 Reporter:  dmsnell                                 |       Owner:  dmsnell
     Type:  feature request                         |      Status:  closed
 Priority:  normal                                  |   Milestone:  6.6
Component:  General                                 |     Version:  trunk
 Severity:  normal                                  |  Resolution:  fixed
 Keywords:  has-patch needs-unit-tests 2nd-opinion  |     Focuses:
----------------------------------------------------+----------------------
Description changed by dmsnell:

Old description:

> Motivated by the need to properly transform HTML named character
> references (like `&`) I found the need for a new semantic class which
> can efficiently perform search and replacement of a set of static tokens.
> Existing patterns in the codebase are not sufficient for the HTML need,
> and I suspect there are other use-cases where this class would help.
>
> It may be helpful to review the dev note draft before reviewing this
> work:
> https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b
>
> == How does WordPress typically perform this kind of substitution?
>
> === An associative array with a generated regular expression callback.
>
> It's common to see a pattern like this, where the lookup is stored as an
> array and a regular expression is built from it.
>
> {{{#!php
> <?php
> $smilies = array(
>   ':(' => "\xf0\x9f\x99\x81",
>   ':)' => "\xf0\x9f\x99\x82",
>   ':?' => "\xf0\x9f\x98\x95",
>   ':D' => "\xf0\x9f\x98\x80",
> );
>
> preg_replace_callback(
>         '/' . implode( '|', array_keys( $smilies ) ) . '/',
>         function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0]
> ]; }
> );
> }}}
>
> This method constructs a large regular expression pattern on every
> request and it's not always clear what performance implications there
> might be with the long list of alternation groups. These can lead to some
> cataclysmic runtimes and the code surrounding these to setup the regular
> expression pattern and replacments tend to appear somewhat exotic,
> masking the intent of the original goal of the code.
>
> === Calls to `str_replace()` with two arrays.
>
> {{{#!php
> <?php
> $chars = array(
>         'ª' => 'a',
>         'º' => 'o',
>         'À' => 'A',
>         'Á' => 'A',
>         'Â' => 'A',
>         'Ã' => 'A',
>         'Ä' => 'A',
> );
>
> str_replace( array_keys( $chars ), array_values( $chars ), $text );
> }}}
>
> This method is clear in the code, but the runtime can be very inefficient
> due to the fact that it's required to iterate over all of the array keys
> until a string match is found. Like the regular expression, performance
> can degrade quickly when there are hundreds or thousands of replacements.
>
> === Scanning with array lookup.
>
> Sometimes code is crawling through a string and testing for set
> membership using an array.
>
> {{{#!php
> <?php
> preg_replace(
>         '~&([^;]+;)~',
>         function ( $name ) use ( $html_entities ) {
>                 $allowed = in_array( $name[0], $html_entities, true ) );
>                 return $allowed ? $name : "&{$name[1]}";
>         }
> );
> }}}
>
> This can work if the tokens follow a searchable pattern, but it's hard to
> apply complicated rules for the token pattern with the regular expression
> syntax, and this method also suffers from slow set-membership checking in
> the lookup array. For small arrays it's fine, but  not for large ones,
> especially if the document contains a high number of the input tokens.
>
> ----
>
> All of these existing methods work best when it's clear that a given
> string contains one of the tokens, but what if it's not certain? The
> performance characteristics of the array-based approaches are worst when
> the document is missing the input tokens. These approaches are memory
> heavy and inefficient computationally.
>
> ----
>
> == A different approach.
>
> I'm proposing a new class whose semantic is a relatively static mapping
> of terms or tokens to replacement strings. This class provides a number
> of purpose-specific methods supporting token-replacement operations,
> including:
>
> - Fast set-membership testing, to determine if a given string is one of
> the tokens.
> - Fast mapping from a given token to its replacement value.
> - Determining if one of the tokens exists at a specific offset, and if
> so, what it is and what its mapping is.
>
> {{{#!php
> <?php
> $named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉',
> 'notin;' => '∉', 'amp;' => '&', … ] );
>
> $matched_string_snippet = '&notin';
> if ( $named_character_references->contains( substr(
> $matched_string_snippet, 1 ) ) ) {
>         // "&notin" is a valid token
> }
>
> // Replace all named character references with their replacements.
> while ( true ) {
>         $was_at = $at;
>         $at = strpos( $text, '&', $at );
>         if ( false === $at ) {
>                 $output .= substr( $text, $was_at )
>                 break;
>         }
>
>         $name_length = 0;
>         $replacement = $named_character_reference->read_token( $text,
> $at, $name_length );
>         if ( false !== $replacement ) {
>                 $output .= substr( $text, $was_at, $at - $was_at );
>                 $output .= $replacement;
>                 $at     += $name_length;
>                 continue;
>         }
>
>         // No named character reference was found, continue searching.
>         ++$at;
> }
> }}}
>
> In this example the leading `&` has been pruned from the lookup table to
> save space and time, but it very well could be added into the table.
>
> Because this class introduces specific semantic meaning, that is, looking
> up and mapping static string tokens to static string replacements, we can
> employ a number of optimizations to reduce the overall memory footprint
> and also to optimize the runtime. Such optimizations have been
> incorporated into the linked PR.
>
> ----
>
> In [https://github.com/WordPress/wordpress-develop/pull/6387 #6387] I
> have built a spec-compliant HTML text decoder which utilizes this token
> map. The performance of the new decoder is approximately 20% slower than
> calling `html_entity_decode()` directly, except it properly decodes what
> PHP can't. In fact, the performance bottleneck in that work is not even
> in the token map, but in converting a sequence of digits into a UTF-8
> string from the given code point.
>
> My proposal is adding a new class `WP_Token_Map` providing at least two
> methods for normal use:
>
>  - `contains( $token )` returns whether the passed string is in the set.
>  - `read_token( $text, $offset = 0, $skip_bytes )` indicates if the
> character sequence starting at the given offset in the passed string
> forms a token in the set, and if so, returns the replacement for that
> token. It also sets `&$skip_bytes` to the length of the token so that
> calling code .
>
> It also provides utility functions for pre-computing these classes, as
> they are designed for relatively-static cases where the actual code is
> intended to be generated dynamically, but stay static over time. For
> example, HTML5 defines the set of named character references and
> indicates that the list //shall not// change or be expanded.
> [https://html.spec.whatwg.org/#named-character-references-table HTML5
> spec]. Precomputing can save on the startup-up cost of building the
> optimized lookup tables.
>
>  - `static::from_array( array $mappings )` generates a new token map from
> the given array of whose keys are tokens and whose values are the
> replacements.
>  - `to_array()` dumps the set of mapping into an array suitable for
> passing back into `from_array()`.
>  - `static::from_precomputed_table( ...$table )` instantiates a token set
> from a precomputed table, skipping the computation for building the table
> and sorting the tokens.
>  - `precomputed_php_source_table()` generates PHP source code which can
> be loaded with the previous static method for maintenance of the core
> static token sets.
>
> == Other potential uses
>
>  - Converting smilies like `:)`
>  - Converting emoji sequences like `:happy:`
>  - Determining if a given verb/action is allowed in an API call.

New description:

 Motivated by the need to properly transform HTML named character
 references (like `&`) I found the need for a new semantic class which
 can efficiently perform search and replacement of a set of static tokens.
 Existing patterns in the codebase are not sufficient for the HTML need,
 and I suspect there are other use-cases where this class would help.

 It may be helpful to review the dev note draft before reviewing this work:
 https://make.wordpress.org/core/?p=113042&preview=1&_ppp=fbe827d6b7

 == How does WordPress typically perform this kind of substitution?

 === An associative array with a generated regular expression callback.

 It's common to see a pattern like this, where the lookup is stored as an
 array and a regular expression is built from it.

 {{{#!php
 <?php
 $smilies = array(
   ':(' => "\xf0\x9f\x99\x81",
   ':)' => "\xf0\x9f\x99\x82",
   ':?' => "\xf0\x9f\x98\x95",
   ':D' => "\xf0\x9f\x98\x80",
 );

 preg_replace_callback(
         '/' . implode( '|', array_keys( $smilies ) ) . '/',
         function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0]
 ]; }
 );
 }}}

 This method constructs a large regular expression pattern on every request
 and it's not always clear what performance implications there might be
 with the long list of alternation groups. These can lead to some
 cataclysmic runtimes and the code surrounding these to setup the regular
 expression pattern and replacments tend to appear somewhat exotic, masking
 the intent of the original goal of the code.

 === Calls to `str_replace()` with two arrays.

 {{{#!php
 <?php
 $chars = array(
         'ª' => 'a',
         'º' => 'o',
         'À' => 'A',
         'Á' => 'A',
         'Â' => 'A',
         'Ã' => 'A',
         'Ä' => 'A',
 );

 str_replace( array_keys( $chars ), array_values( $chars ), $text );
 }}}

 This method is clear in the code, but the runtime can be very inefficient
 due to the fact that it's required to iterate over all of the array keys
 until a string match is found. Like the regular expression, performance
 can degrade quickly when there are hundreds or thousands of replacements.

 === Scanning with array lookup.

 Sometimes code is crawling through a string and testing for set membership
 using an array.

 {{{#!php
 <?php
 preg_replace(
         '~&([^;]+;)~',
         function ( $name ) use ( $html_entities ) {
                 $allowed = in_array( $name[0], $html_entities, true ) );
                 return $allowed ? $name : "&{$name[1]}";
         }
 );
 }}}

 This can work if the tokens follow a searchable pattern, but it's hard to
 apply complicated rules for the token pattern with the regular expression
 syntax, and this method also suffers from slow set-membership checking in
 the lookup array. For small arrays it's fine, but  not for large ones,
 especially if the document contains a high number of the input tokens.

 ----

 All of these existing methods work best when it's clear that a given
 string contains one of the tokens, but what if it's not certain? The
 performance characteristics of the array-based approaches are worst when
 the document is missing the input tokens. These approaches are memory
 heavy and inefficient computationally.

 ----

 == A different approach.

 I'm proposing a new class whose semantic is a relatively static mapping of
 terms or tokens to replacement strings. This class provides a number of
 purpose-specific methods supporting token-replacement operations,
 including:

 - Fast set-membership testing, to determine if a given string is one of
 the tokens.
 - Fast mapping from a given token to its replacement value.
 - Determining if one of the tokens exists at a specific offset, and if so,
 what it is and what its mapping is.

 {{{#!php
 <?php
 $named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉',
 'notin;' => '∉', 'amp;' => '&', … ] );

 $matched_string_snippet = '&notin';
 if ( $named_character_references->contains( substr(
 $matched_string_snippet, 1 ) ) ) {
         // "&notin" is a valid token
 }

 // Replace all named character references with their replacements.
 while ( true ) {
         $was_at = $at;
         $at = strpos( $text, '&', $at );
         if ( false === $at ) {
                 $output .= substr( $text, $was_at )
                 break;
         }

         $name_length = 0;
         $replacement = $named_character_reference->read_token( $text, $at,
 $name_length );
         if ( false !== $replacement ) {
                 $output .= substr( $text, $was_at, $at - $was_at );
                 $output .= $replacement;
                 $at     += $name_length;
                 continue;
         }

         // No named character reference was found, continue searching.
         ++$at;
 }
 }}}

 In this example the leading `&` has been pruned from the lookup table to
 save space and time, but it very well could be added into the table.

 Because this class introduces specific semantic meaning, that is, looking
 up and mapping static string tokens to static string replacements, we can
 employ a number of optimizations to reduce the overall memory footprint
 and also to optimize the runtime. Such optimizations have been
 incorporated into the linked PR.

 ----

 In [https://github.com/WordPress/wordpress-develop/pull/6387 #6387] I have
 built a spec-compliant HTML text decoder which utilizes this token map.
 The performance of the new decoder is approximately 20% slower than
 calling `html_entity_decode()` directly, except it properly decodes what
 PHP can't. In fact, the performance bottleneck in that work is not even in
 the token map, but in converting a sequence of digits into a UTF-8 string
 from the given code point.

 My proposal is adding a new class `WP_Token_Map` providing at least two
 methods for normal use:

  - `contains( $token )` returns whether the passed string is in the set.
  - `read_token( $text, $offset = 0, $skip_bytes )` indicates if the
 character sequence starting at the given offset in the passed string forms
 a token in the set, and if so, returns the replacement for that token. It
 also sets `&$skip_bytes` to the length of the token so that calling code .

 It also provides utility functions for pre-computing these classes, as
 they are designed for relatively-static cases where the actual code is
 intended to be generated dynamically, but stay static over time. For
 example, HTML5 defines the set of named character references and indicates
 that the list //shall not// change or be expanded.
 [https://html.spec.whatwg.org/#named-character-references-table HTML5
 spec]. Precomputing can save on the startup-up cost of building the
 optimized lookup tables.

  - `static::from_array( array $mappings )` generates a new token map from
 the given array of whose keys are tokens and whose values are the
 replacements.
  - `to_array()` dumps the set of mapping into an array suitable for
 passing back into `from_array()`.
  - `static::from_precomputed_table( ...$table )` instantiates a token set
 from a precomputed table, skipping the computation for building the table
 and sorting the tokens.
  - `precomputed_php_source_table()` generates PHP source code which can be
 loaded with the previous static method for maintenance of the core static
 token sets.

 == Other potential uses

  - Converting smilies like `:)`
  - Converting emoji sequences like `:happy:`
  - Determining if a given verb/action is allowed in an API call.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60698#comment:28>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform