[wp-meta] [Making WordPress.org] #1692: Plugin search quality improvements
Making WordPress.org
noreply at wordpress.org
Fri May 6 20:20:15 UTC 2016
#1692: Plugin search quality improvements
------------------------------+---------------------------------------
Reporter: tellyworth | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone: Plugin Directory v3 - M3
Component: Plugin Directory | Resolution:
Keywords: |
------------------------------+---------------------------------------
Comment (by gibrown):
I've built a custom Elasticsearch index for the plugins which does a
better job of indexing a lot of the data that was previously only
available in meta fields. This index is also running on our Elasticsearch
2.3 cluster which means that all of the newer aggregations and query
options should be available (previous index was on 1.3).
I've disabled the index that @tellyworth was using before and enabled this
one. Its backwards compatible, so the currently deployed test search query
is still working.
The index building code and full mapping can be seen in this gist:
[https://gist.github.com/gibrown/d4750aa773154948c81791fe18bdc521]. It
relies on the wpes-lib framework ([https://github.com/automattic/wpes-
lib]]).
Some details on how we are currently indexing content:
- We have a separate field for each custom language analyzer we have
configured. Currently have 29 language analyzers: ar, bg, ca, cs, da, de,
el, en, es, eu, fa, fi, fr, he, hi, hu, hy, id, it, ja, ko, nl, no, pt,
ro, ru, sv, tr, zh
- All of the analyzed text fields ('content', 'title', 'excerpt', and
'upgrade_notice') have an associated field for this analyzer (eg
'content_es').
- If a language doesn't have a custom language analyzer, then it should
use the default field (eg 'content') which tries to use some reasonable
defaults.
- Right now the non-English fields are being populated by looking for an
associated meta value (eg content for Spanish looks for the meta key
'content_es'). An open question is whether we should instead reindex
nightly and query GlotPress for the translation.
- For every content field (and a few others) we also have an _ngram
version of the field
- ngrams take an individual token (like 'wordpress') and stores it as
multiple character sequences (eg 2+3 grams: 'wo', 'wor', 'or', 'ord',
'rd', 'rdp', 'dp', 'dpr', 'pr', 'pre', 're', 'res', 'es', 'ess', 'ss')
- ngrams should enable us to build very fast and relevant instant search
results so that a user never has to hit enter. Simply run a query any time
the user seems to pause for more than a 100-200 ms.
- There is not anything special that the client needs to do for ngrams or
language analysis. It just needs to run a multi_match query on the
appropriate fields.
Some fields I want to highlight and how I think they could be used for
improving search results:
- Obviously all of the content fields should be searched with a
multi_match query. Adding phrase queries to the query should also help.
Example:
{{{
{ "query" : { "bool" : {
"must" : [
"multi_match" : {
"type" : "cross_fields", //enables matching terms in different
fields (eg "Matt Hello Dolly")
"fields" :
["content_en","title_en^2","excerpt_en","upgrade_notice_en^0.5","slug_ngram","header_author","contributors"],
"query": "Matt Hello Dolly",
}
],
"should" : [
"multi_match" : {
"type" : "phrase", //treat the whole
"fields" :
["content_en","title_en^2","excerpt_en","upgrade_notice_en^0.5","slug_ngram","header_author","contributors"],
"query": "Matt Hello Dolly",
}
],
} }
}}}
- number_of_translations : Counts the number of fields translated for each
language (translating content, title, excerpt, and upgrade_notice to one
language will get 4 points). We should rank translated plugins higher.
Should be a good signal of quality and encourages plugin authors to
translate plugins. Just multiply in a log1p field_value_factor scoring
- tested: latest WP version as a float. Higher is always better. Just
multiply in a log1p field_value_factor scoring
- required : potentially could use this as a signal of how long a plugin
has been around. But easy to game.
- stable_tag : not sure if this is useful :)
- tagged_versions : unsure if useful
- number_of_versions : more tags (maybe) means it has been supported for a
while. Would be better if we had some dates when the tags happened
- percent_on_stable : based on meta.usage and the meta.stable_tag this is
(roughly) percentage of users who trust this plugin enough to upgrade.
Could be used in a 'script_score' to adjust number of active installs or
in a log1p field_value_factor scoring.
- active_installs : obviously in log1p field_value_factor scoring.
- support_resolution_yes, support_resolution_no, support_resolution_mu :
(I'm not sure about definition of this compared to support threads)
- support_resolution_percentage: = support_resolution_yes /
(support_resolution_yes + support_resolution_no + support_resolution_mu)
log1p field_value_factor scoring
- support_threads : log1p field_value_factor scoring
- support_threads_resolved : log1p field_value_factor scoring
- support_threads_percentage : support_threads_resolved / support_threads
: log1p field_value_factor scoring
- contributors_active_installs: the sum of the number of active installs
across all the plugins and all the authors of this plugin. Should be a
great signal. Example https://wordpress.org/plugins/slug-control/ by Make
Jaquith has 30 active installs. Because of who wrote it I have no doubt it
is a great plugin for that use case, but 99% of users don't know that.
Total active installs by the contributors should be a good proxy for this.
Open issues (I think):
- Should we bulk reindex this index daily and query GlotPress to get the
latest translations? Can someone provide me with the PHP code for doing
that?
- We should probably have a complete whitelist of supported langs and have
a field for each with the appropriately configured lang analyzer? Is there
a full list?
- What should the full query be when doing instant search?
- What should the full query be when doing non-instant search?
- It might be nice to add some meta fields with actual user names of the
contributors to search against.
- Related to @tellyworth's point about generic terms. We should not treat
an exact name match as highly as we have been in the past. I think just
boosting the title a bit will do that, but it needs some experimentation.
I'm biased with this example, but let's look at
[https://wordpress.org/plugins/search.php?q=related+posts]. The top
recommendation has 40k installs. YARPP at #2 has 300k. Jetpack is not on
the first couple of pages despite having far more installs. The currently
deployed test query though is also far too heavily weighted towards number
of installs (see [https://cloudup.com/cZK8ZPNUE5p]). That's pretty
terrible results. We need to find some middle ground and balance for these
weightings. "SEO" is another case that is probably worth testing with.
--
Ticket URL: <https://meta.trac.wordpress.org/ticket/1692#comment:4>
Making WordPress.org <https://meta.trac.wordpress.org/>
Making WordPress.org
More information about the wp-meta
mailing list