[wp-trac] [WordPress Trac] #56294: WordPress search finds block name in comment
WordPress Trac
noreply at wordpress.org
Wed Jun 26 00:35:05 UTC 2024
#56294: WordPress search finds block name in comment
--------------------------------------+--------------------------
Reporter: zodiac1978 | Owner: (none)
Type: enhancement | Status: closed
Priority: normal | Milestone:
Component: Database | Version: 5.0
Severity: normal | Resolution: maybelater
Keywords: needs-patch dev-feedback | Focuses: performance
--------------------------------------+--------------------------
Comment (by dmsnell):
What an interesting and exciting challenge to solve. I'll share some of my
own thoughts, having worked on search indexing in different platforms and
having worked on the serialized block HTML at all levels.
Concerning the use of functions like `REPLACE_REGEX` I really caution
folks to consider what those are implying on the database when performing
a search. They end up parsing and modifying every row on every search. For
small test sites this probably never amounts to much, but perhaps on a
site with thousands of posts and thousands of daily visitors, this could
rapidly overwhelm the database. I don't see //computing the search index
on every search query// as a super viable option.
That being said, the discussions about storing a kind of transformed post
in another location would make searching easier with the existing toolsets
and performance characteristics. Post meta is a convenient approach, but
may not be the most ideal for similar performance reasons. It could be
similarly computed as an additional column on the post row in the
database, or in a separate database table just for post indexes.
The HTML API finally provides powerful and reliable tools for searching
the rendered or text content of a post. For example, WordPress could store
a plaintext view over a post every time it updates a post, and searches
can be performed against that. This not only would work around the
challenges posed by the block content, but also the very same challenges
which have always existed within WordPress' search. For example, it's
always been the case that if you search for `form` or `code` or `template`
and a post contains those tags, that the search will return those false
results.
Generating the text content has never been easier, and because of the HTML
API interface it also gets around unexpected constructions involving
character rerferences, as it always decodes them before presenting the
string values to calling PHP code.
{{{#!php
<?php
function get_text_content( $html ) {
$text_content = '';
$processor = new WP_HTML_Tag_Processor( $html );
while ( $processor->next_token() ) {
if ( '#text' === $processor->get_token_name() ) {
$text_content .=
$processor->get_modifiable_text();
}
}
return $text_content;
}
}}}
Upon this foundation all sorts of stronger search indices can be built and
then searched.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/56294#comment:20>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list