[wp-trac] [WordPress Trac] #44296: Enable double-width space works as a separator in search query

WordPress Trac noreply at wordpress.org
Wed Jul 11 14:21:51 UTC 2018


#44296: Enable double-width space works as a separator in search query
--------------------------------------+-----------------------------
 Reporter:  ryotsun                   |       Owner:  SergeyBiryukov
     Type:  defect (bug)              |      Status:  reviewing
 Priority:  normal                    |   Milestone:  5.0
Component:  Query                     |     Version:  trunk
 Severity:  normal                    |  Resolution:
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+-----------------------------

Comment (by birgire):

 The [attachment:"44296_2.patch"] replaces the double-width space for all
 searches, but what about limiting it to the case where {{{sentence}}} is
 {{{false}}} (the default case)?

 Then I think it would be nice to have unit tests for these cases:

 {{{
 1) $query = new WP_Query( array( 's' => $terms ) );
 2) $query = new WP_Query( array( 's' => $terms, 'sentence' => true ) );
 3) $query = new WP_Query( array( 's' => $terms, 'exact' => true ) );
 4) $query = new WP_Query( array( 's' => $terms, 'exact' => true,
 'sentence' => true  ) );

 }}}


 I wonder if some of the various white spaces characters:

 https://en.wikipedia.org/wiki/Whitespace_character

 including the ideographic space (\u3000), should be handled?


 I played with the {{{u}}} modifier in PCRE and skimmed through the great
 info [https://www.regular-expressions.info/unicode.html here]


 Here's a test example, inspired by
 [https://stackoverflow.com/questions/20105567/php-convert-unicode-spaces-
 to-ascii-spaces the question here]:
 {{{

 $json = json_decode( '{ "string" :
 "This\u3000is\u2001a\u2002search\u2003for\u2004þ ö\tá\n" }' );

 echo preg_replace( '/[\p{Zs}]/u', '_', $json->string );
 echo preg_replace( '/[\pZ]/u',    '_', $json->string );
 echo preg_replace( '/[\pZ\pC]/u', '_', $json->string );
 }}}

 with output:

 {{{
 This_is_a_search_for_þ_ö        á

 This_is_a_search_for_þ_ö        á

 This_is_a_search_for_þ_ö_á_

 }}}

 I also played with e.g.:

 {{{
 if ( preg_match_all( '/".*?("|$)|((?<=[\p{Zs}",+])|^)[^\p{Zs}",+]+/u',
 $q['s'], $matches ) ) {

 }}}

 instead of:

 {{{
 if ( preg_match_all( '/".*?("|$)|((?<=[\t ",+])|^)[^\t ",+]+/', $q['s'],
 $matches ) ) {

 }}}

 in {{{WP_Query::parse_search()}}}.

 I'm not sure about the general support for the PCRE {{{u}}} modifier
 though and the security aspect of it.

 Ticket #24661 introduces the {{{_wp_can_use_pcre_u()}}} function.

 It's interesting to see how core uses the {{{u}}} modifier, both with a
 support check and also without such a check, e.g. in
 {{{sanitize_file_name()}}},
 {{{WP_Text_Diff_Renderer_inline::_splitOnWords()}}} and
 {{{wp_maybe_decline_date()}}}.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/44296#comment:9>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list