[wp-trac] [WordPress Trac] #20383: Strip trailing punctuation with canonical URLs
WordPress Trac
noreply at wordpress.org
Sun Mar 12 14:26:53 UTC 2017
#20383: Strip trailing punctuation with canonical URLs
---------------------------------------------+-----------------------------
Reporter: nacin | Owner: SergeyBiryukov
Type: defect (bug) | Status: reopened
Priority: normal | Milestone: 4.8
Component: Canonical | Version:
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests commit | Focuses:
---------------------------------------------+-----------------------------
Comment (by gitlost):
> It seems like parse_url() is (sometimes?) not multibyte aware
Another one of those incredibly tedious locale gotchas, this time only
affecting non-Linux systems it seems.
If you run this on OSX or Windows 10:
{{{#!php
<?php
error_log( "default LC_CTYPE=" . setlocale( LC_CTYPE, '0' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' )
);
$before = parse_url(
'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "before=$before" );
error_log( "before === " . ( '/wiki/Преступление_и_наказание' === $before
? 'true' : 'false' ) );
error_log( "setting LC_CTYPE=" . setlocale( LC_CTYPE, 'C' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' )
);
$after = parse_url(
'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "after=$after" );
error_log( "after === " . ( '/wiki/Преступление_и_наказание' === $after ?
'true' : 'false' ) );
}}}
then if your default locale is "UTF-8ish" the `$before` will be corrupted
while the `$after`, with `LC_CTYPE` set to `'C'`, won't.
It's due to the libc function `iscntrl()` returning true for certain non-
ASCII characters like `\x90` on these systems depending on locale, which
makes PHP replace them with underscores ([https://github.com/php/php-
src/blob/c8aa6f3a9a3d2c114d0c5e0c9fdd0a465dbb54a5/ext/standard/url.c#L75
ext/standard/url.c:75]).
To get around it one could use the Joomla function, or maybe just wrap
`parse_url()` with before and after setlocales, eg:
{{{#!php
<?php
function parse_url_utf8( $url, $component = -1 ) {
$prev_ctype_locale = setlocale( LC_CTYPE, '0' );
setlocale( LC_CTYPE, 'C' );
$ret = parse_url( $url, $component );
setlocale( LC_CTYPE, $prev_ctype_locale );
return $ret;
}
}}}
--
Ticket URL: <https://core.trac.wordpress.org/ticket/20383#comment:11>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list