[wp-trac] [WordPress Trac] #56294: WordPress search finds block name in comment
WordPress Trac
noreply at wordpress.org
Mon Apr 3 14:42:34 UTC 2023
#56294: WordPress search finds block name in comment
--------------------------------------+------------------------------
Reporter: zodiac1978 | Owner: (none)
Type: enhancement | Status: closed
Priority: normal | Milestone: Awaiting Review
Component: Database | Version: 5.0
Severity: normal | Resolution: maybelater
Keywords: needs-patch dev-feedback | Focuses: performance
--------------------------------------+------------------------------
Comment (by l1nuxjedi):
Replying to [comment:13 zodiac1978]:
> Replying to [comment:12 l1nuxjedi]:
> > In fact the first example here shows how to modify that DB query to
filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
>
> The problem with parsing HTML with RegEx is, that there are so many edge
cases that will break it. One "<" in the content is filtering out
everything until the next closing ">" for example ...
>
> See: https://stackoverflow.com/a/1732454
>
> And with the example from the MariaDB knowledge base:
https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).
I would hope in that case that you have literal < and > encoded using HTML
entities. Otherwise the only real solution to avoid edge cases would be a
proper HTML parser to retrieve the raw text from and parse that into some
index generating code to use for the search.
Another suggestion would be an engine such as Sphinx as the backend
instead, which can intelligently filter out meta tags. Not necessarily
that one though. It has been over 15 years since I've built something
similar to what you are trying to achieve here and I'm sure the correct
technology has moved on since then.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/56294#comment:16>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list