[wp-meta] [Making WordPress.org] #1692: Plugin search quality improvements

Making WordPress.org noreply at wordpress.org
Fri May 6 20:20:15 UTC 2016


#1692: Plugin search quality improvements
------------------------------+---------------------------------------
 Reporter:  tellyworth        |       Owner:
     Type:  enhancement       |      Status:  new
 Priority:  normal            |   Milestone:  Plugin Directory v3 - M3
Component:  Plugin Directory  |  Resolution:
 Keywords:                    |
------------------------------+---------------------------------------

Comment (by gibrown):

 I've built a custom Elasticsearch index for the plugins which does a
 better job of indexing a lot of the data that was previously only
 available in meta fields. This index is also running on our Elasticsearch
 2.3 cluster which means that all of the newer aggregations and query
 options should be available (previous index was on 1.3).

 I've disabled the index that @tellyworth was using before and enabled this
 one. Its backwards compatible, so the currently deployed test search query
 is still working.

 The index building code and full mapping can be seen in this gist:
 [https://gist.github.com/gibrown/d4750aa773154948c81791fe18bdc521]. It
 relies on the wpes-lib framework ([https://github.com/automattic/wpes-
 lib]]).

 Some details on how we are currently indexing content:
 - We have a separate field for each custom language analyzer we have
 configured. Currently have 29 language analyzers: ar, bg, ca, cs, da, de,
 el, en, es, eu, fa, fi, fr, he, hi, hu, hy, id, it, ja, ko, nl, no, pt,
 ro, ru, sv, tr, zh
 - All of the analyzed text fields ('content', 'title', 'excerpt', and
 'upgrade_notice') have an associated field for this analyzer (eg
 'content_es').
 - If a language doesn't have a custom language analyzer, then it should
 use the default field (eg 'content') which tries to use some reasonable
 defaults.
 - Right now the non-English fields are being populated by looking for an
 associated meta value (eg content for Spanish looks for the meta key
 'content_es'). An open question is whether we should instead reindex
 nightly and query GlotPress for the translation.
 - For every content field (and a few others) we also have an _ngram
 version of the field
 - ngrams take an individual token (like 'wordpress') and stores it as
 multiple character sequences (eg 2+3 grams: 'wo', 'wor', 'or', 'ord',
 'rd', 'rdp', 'dp', 'dpr', 'pr', 'pre', 're', 'res', 'es', 'ess', 'ss')
 - ngrams should enable us to build very fast and relevant instant search
 results so that a user never has to hit enter. Simply run a query any time
 the user seems to pause for more than a 100-200 ms.
 - There is not anything special that the client needs to do for ngrams or
 language analysis. It just needs to run a multi_match query on the
 appropriate fields.

 Some fields I want to highlight and how I think they could be used for
 improving search results:
 - Obviously all of the content fields should be searched with a
 multi_match query. Adding phrase queries to the query should also help.
 Example:


 {{{
 { "query" : { "bool" : {
   "must" : [
     "multi_match" : {
       "type" : "cross_fields", //enables matching terms in different
 fields (eg "Matt Hello Dolly")
       "fields" :
 ["content_en","title_en^2","excerpt_en","upgrade_notice_en^0.5","slug_ngram","header_author","contributors"],
       "query": "Matt Hello Dolly",
     }
   ],
   "should" : [
     "multi_match" : {
       "type" : "phrase", //treat the whole
       "fields" :
 ["content_en","title_en^2","excerpt_en","upgrade_notice_en^0.5","slug_ngram","header_author","contributors"],
       "query": "Matt Hello Dolly",
     }
   ],
 } }

 }}}

 - number_of_translations : Counts the number of fields translated for each
 language (translating content, title, excerpt, and upgrade_notice to one
 language will get 4 points). We should rank translated plugins higher.
 Should be a good signal of quality and encourages plugin authors to
 translate plugins. Just multiply in a log1p field_value_factor scoring
 - tested: latest WP version as a float. Higher is always better. Just
 multiply in a log1p field_value_factor scoring
 - required :  potentially could use this as a signal of how long a plugin
 has been around. But easy to game.
 - stable_tag : not sure if this is useful :)
 - tagged_versions : unsure if useful
 - number_of_versions : more tags (maybe) means it has been supported for a
 while. Would be better if we had some dates when the tags happened
 - percent_on_stable : based on meta.usage and the meta.stable_tag this is
 (roughly) percentage of users who trust this plugin enough to upgrade.
 Could be used in a 'script_score' to adjust number of active installs or
 in a log1p field_value_factor scoring.
 - active_installs : obviously in log1p field_value_factor scoring.
 - support_resolution_yes, support_resolution_no, support_resolution_mu :
 (I'm not sure about definition of this compared to support threads)
 - support_resolution_percentage: = support_resolution_yes /
 (support_resolution_yes + support_resolution_no + support_resolution_mu)
 log1p field_value_factor scoring
 - support_threads : log1p field_value_factor scoring
 - support_threads_resolved : log1p field_value_factor scoring
 - support_threads_percentage : support_threads_resolved / support_threads
 : log1p field_value_factor scoring
 - contributors_active_installs: the sum of the number of active installs
 across all the plugins and all the authors of this plugin. Should be a
 great signal. Example https://wordpress.org/plugins/slug-control/ by Make
 Jaquith has 30 active installs. Because of who wrote it I have no doubt it
 is a great plugin for that use case, but 99% of users don't know that.
 Total active installs by the contributors should be a good proxy for this.


 Open issues (I think):
 - Should we bulk reindex this index daily and query GlotPress to get the
 latest translations? Can someone provide me with the PHP code for doing
 that?
 - We should probably have a complete whitelist of supported langs and have
 a field for each with the appropriately configured lang analyzer? Is there
 a full list?
 - What should the full query be when doing instant search?
 - What should the full query be when doing non-instant search?
 - It might be nice to add some meta fields with actual user names of the
 contributors to search against.
 - Related to @tellyworth's point about generic terms. We should not treat
 an exact name match as highly as we have been in the past. I think just
 boosting the title a bit will do that, but it needs some experimentation.
 I'm biased with this example, but let's look at
 [https://wordpress.org/plugins/search.php?q=related+posts]. The top
 recommendation has 40k installs. YARPP at #2 has 300k. Jetpack is not on
 the first couple of pages despite having far more installs. The currently
 deployed test query though is also far too heavily weighted towards number
 of installs (see [https://cloudup.com/cZK8ZPNUE5p]). That's pretty
 terrible results. We need to find some middle ground and balance for these
 weightings. "SEO" is another case that is probably worth testing with.

--
Ticket URL: <https://meta.trac.wordpress.org/ticket/1692#comment:4>
Making WordPress.org <https://meta.trac.wordpress.org/>
Making WordPress.org


More information about the wp-meta mailing list