[wp-trac] [WordPress Trac] #20383: Strip trailing punctuation with canonical URLs

WordPress Trac noreply at wordpress.org
Sun Mar 12 14:26:53 UTC 2017


#20383: Strip trailing punctuation with canonical URLs
---------------------------------------------+-----------------------------
 Reporter:  nacin                            |       Owner:  SergeyBiryukov
     Type:  defect (bug)                     |      Status:  reopened
 Priority:  normal                           |   Milestone:  4.8
Component:  Canonical                        |     Version:
 Severity:  normal                           |  Resolution:
 Keywords:  has-patch has-unit-tests commit  |     Focuses:
---------------------------------------------+-----------------------------

Comment (by gitlost):

 >  It seems like parse_url() is (sometimes?) not multibyte aware

 Another one of those incredibly tedious locale gotchas, this time only
 affecting non-Linux systems it seems.

 If you run this on OSX or Windows 10:

 {{{#!php
 <?php
 error_log( "default LC_CTYPE=" . setlocale( LC_CTYPE, '0' ) );
 error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' )
 );

 $before = parse_url(
 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
 error_log( "before=$before" );
 error_log( "before === " . ( '/wiki/Преступление_и_наказание' === $before
 ? 'true' : 'false' ) );

 error_log( "setting LC_CTYPE=" . setlocale( LC_CTYPE, 'C' ) );
 error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' )
 );

 $after = parse_url(
 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
 error_log( "after=$after" );
 error_log( "after === " . ( '/wiki/Преступление_и_наказание' === $after ?
 'true' : 'false' ) );
 }}}

 then if your default locale is "UTF-8ish" the `$before` will be corrupted
 while the `$after`, with `LC_CTYPE` set to `'C'`, won't.

 It's due to the libc function `iscntrl()` returning true for certain non-
 ASCII characters like `\x90` on these systems depending on locale, which
 makes PHP replace them with underscores ([https://github.com/php/php-
 src/blob/c8aa6f3a9a3d2c114d0c5e0c9fdd0a465dbb54a5/ext/standard/url.c#L75
 ext/standard/url.c:75]).

 To get around it one could use the Joomla function, or maybe just wrap
 `parse_url()` with before and after setlocales, eg:

 {{{#!php
 <?php
 function parse_url_utf8( $url, $component = -1 ) {
         $prev_ctype_locale = setlocale( LC_CTYPE, '0' );
         setlocale( LC_CTYPE, 'C' );
         $ret = parse_url( $url, $component );
         setlocale( LC_CTYPE, $prev_ctype_locale );
         return $ret;
 }
 }}}

--
Ticket URL: <https://core.trac.wordpress.org/ticket/20383#comment:11>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list