[wp-trac] [WordPress Trac] #56294: WordPress search finds block name in comment

WordPress Trac noreply at wordpress.org
Wed Jun 26 00:35:05 UTC 2024


#56294: WordPress search finds block name in comment
--------------------------------------+--------------------------
 Reporter:  zodiac1978                |       Owner:  (none)
     Type:  enhancement               |      Status:  closed
 Priority:  normal                    |   Milestone:
Component:  Database                  |     Version:  5.0
 Severity:  normal                    |  Resolution:  maybelater
 Keywords:  needs-patch dev-feedback  |     Focuses:  performance
--------------------------------------+--------------------------

Comment (by dmsnell):

 What an interesting and exciting challenge to solve. I'll share some of my
 own thoughts, having worked on search indexing in different platforms and
 having worked on the serialized block HTML at all levels.

 Concerning the use of functions like `REPLACE_REGEX` I really caution
 folks to consider what those are implying on the database when performing
 a search. They end up parsing and modifying every row on every search. For
 small test sites this probably never amounts to much, but perhaps on a
 site with thousands of posts and thousands of daily visitors, this could
 rapidly overwhelm the database. I don't see //computing the search index
 on every search query// as a super viable option.

 That being said, the discussions about storing a kind of transformed post
 in another location would make searching easier with the existing toolsets
 and performance characteristics. Post meta is a convenient approach, but
 may not be the most ideal for similar performance reasons. It could be
 similarly computed as an additional column on the post row in the
 database, or in a separate database table just for post indexes.

 The HTML API finally provides powerful and reliable tools for searching
 the rendered or text content of a post. For example, WordPress could store
 a plaintext view over a post every time it updates a post, and searches
 can be performed against that. This not only would work around the
 challenges posed by the block content, but also the very same challenges
 which have always existed within WordPress' search. For example, it's
 always been the case that if you search for `form` or `code` or `template`
 and a post contains those tags, that the search will return those false
 results.

 Generating the text content has never been easier, and because of the HTML
 API interface it also gets around unexpected constructions involving
 character rerferences, as it always decodes them before presenting the
 string values to calling PHP code.

 {{{#!php
 <?php
 function get_text_content( $html ) {
         $text_content = '';
         $processor = new WP_HTML_Tag_Processor( $html );
         while ( $processor->next_token() ) {
                 if ( '#text' === $processor->get_token_name() ) {
                         $text_content .=
 $processor->get_modifiable_text();
                 }
         }
         return $text_content;
 }
 }}}

 Upon this foundation all sorts of stronger search indices can be built and
 then searched.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/56294#comment:20>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list