[wp-trac] [WordPress Trac] #56294: WordPress search finds block name in comment

WordPress Trac noreply at wordpress.org
Mon Apr 3 14:42:34 UTC 2023


#56294: WordPress search finds block name in comment
--------------------------------------+------------------------------
 Reporter:  zodiac1978                |       Owner:  (none)
     Type:  enhancement               |      Status:  closed
 Priority:  normal                    |   Milestone:  Awaiting Review
Component:  Database                  |     Version:  5.0
 Severity:  normal                    |  Resolution:  maybelater
 Keywords:  needs-patch dev-feedback  |     Focuses:  performance
--------------------------------------+------------------------------

Comment (by l1nuxjedi):

 Replying to [comment:13 zodiac1978]:
 > Replying to [comment:12 l1nuxjedi]:
 > > In fact the first example here shows how to modify that DB query to
 filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
 >
 > The problem with parsing HTML with RegEx is, that there are so many edge
 cases that will break it. One "<" in the content is filtering out
 everything until the next closing ">" for example ...
 >
 > See: https://stackoverflow.com/a/1732454
 >
 > And with the example from the MariaDB knowledge base:
 https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).

 I would hope in that case that you have literal < and > encoded using HTML
 entities. Otherwise the only real solution to avoid edge cases would be a
 proper HTML parser to retrieve the raw text from and parse that into some
 index generating code to use for the search.

 Another suggestion would be an engine such as Sphinx as the backend
 instead, which can intelligently filter out meta tags. Not necessarily
 that one though. It has been over 15 years since I've built something
 similar to what you are trying to achieve here and I'm sure the correct
 technology has moved on since then.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/56294#comment:16>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list