[wp-trac] [WordPress Trac] #44296: Enable double-width space works as a separator in search query
WordPress Trac
noreply at wordpress.org
Wed Jul 11 14:21:51 UTC 2018
#44296: Enable double-width space works as a separator in search query
--------------------------------------+-----------------------------
Reporter: ryotsun | Owner: SergeyBiryukov
Type: defect (bug) | Status: reviewing
Priority: normal | Milestone: 5.0
Component: Query | Version: trunk
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+-----------------------------
Comment (by birgire):
The [attachment:"44296_2.patch"] replaces the double-width space for all
searches, but what about limiting it to the case where {{{sentence}}} is
{{{false}}} (the default case)?
Then I think it would be nice to have unit tests for these cases:
{{{
1) $query = new WP_Query( array( 's' => $terms ) );
2) $query = new WP_Query( array( 's' => $terms, 'sentence' => true ) );
3) $query = new WP_Query( array( 's' => $terms, 'exact' => true ) );
4) $query = new WP_Query( array( 's' => $terms, 'exact' => true,
'sentence' => true ) );
}}}
I wonder if some of the various white spaces characters:
https://en.wikipedia.org/wiki/Whitespace_character
including the ideographic space (\u3000), should be handled?
I played with the {{{u}}} modifier in PCRE and skimmed through the great
info [https://www.regular-expressions.info/unicode.html here]
Here's a test example, inspired by
[https://stackoverflow.com/questions/20105567/php-convert-unicode-spaces-
to-ascii-spaces the question here]:
{{{
$json = json_decode( '{ "string" :
"This\u3000is\u2001a\u2002search\u2003for\u2004þ ö\tá\n" }' );
echo preg_replace( '/[\p{Zs}]/u', '_', $json->string );
echo preg_replace( '/[\pZ]/u', '_', $json->string );
echo preg_replace( '/[\pZ\pC]/u', '_', $json->string );
}}}
with output:
{{{
This_is_a_search_for_þ_ö á
This_is_a_search_for_þ_ö á
This_is_a_search_for_þ_ö_á_
}}}
I also played with e.g.:
{{{
if ( preg_match_all( '/".*?("|$)|((?<=[\p{Zs}",+])|^)[^\p{Zs}",+]+/u',
$q['s'], $matches ) ) {
}}}
instead of:
{{{
if ( preg_match_all( '/".*?("|$)|((?<=[\t ",+])|^)[^\t ",+]+/', $q['s'],
$matches ) ) {
}}}
in {{{WP_Query::parse_search()}}}.
I'm not sure about the general support for the PCRE {{{u}}} modifier
though and the security aspect of it.
Ticket #24661 introduces the {{{_wp_can_use_pcre_u()}}} function.
It's interesting to see how core uses the {{{u}}} modifier, both with a
support check and also without such a check, e.g. in
{{{sanitize_file_name()}}},
{{{WP_Text_Diff_Renderer_inline::_splitOnWords()}}} and
{{{wp_maybe_decline_date()}}}.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/44296#comment:9>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list