[wp-trac] [WordPress Trac] #20383: Strip trailing punctuation with canonical URLs

WordPress Trac noreply at wordpress.org
Sat Mar 11 16:37:03 UTC 2017


#20383: Strip trailing punctuation with canonical URLs
---------------------------------------------+-----------------------------
 Reporter:  nacin                            |       Owner:  SergeyBiryukov
     Type:  defect (bug)                     |      Status:  reopened
 Priority:  normal                           |   Milestone:  4.8
Component:  Canonical                        |     Version:
 Severity:  normal                           |  Resolution:
 Keywords:  has-patch has-unit-tests commit  |     Focuses:
---------------------------------------------+-----------------------------
Changes (by ocean90):

 * status:  closed => reopened
 * resolution:  fixed =>


Comment:

 I can't get the curly quote tests from [40256] passing:

 {{{
 1) Tests_Canonical_NoRewrite::test with data set #25 ('/?p=358“',
 array('/?p=358', array('358')), 20383)
 Ticket #20383
 Failed asserting that two strings are equal.
 --- Expected
 +++ Actual
 @@ @@
 -'/?p=358'
 +'/?p=358�__'
 }}}

 I'm using OS X and already tried several PHP versions: pre-installed PHP
 5.6.28 or via homebrew 5.6.29, 7.0.15, 7.1.2 and 7.2.0-dev.

 It seems like `parse_url()` is (sometimes?) not multibyte aware
 * https://bugs.php.net/bug.php?id=52923
 * http://bluebones.net/2013/04/parse_url-is-not-utf-8-safe/
 * https://github.com/fguillot/picoFeed/issues/167

 Using PHP's built-in server I get the following output for
 `http://localhost:8000/test.php?p=358“`

 {{{
 $_GET
 array(1) {
   ["p"]=>
   string(6) "358”"
 }

 $_SERVER['REQUEST_URI']
 string(24) "/test.php?p=358%E2%80%9D"

 parse_url( $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'] )
 array(4) {
   ["host"]=>
   string(9) "localhost"
   ["port"]=>
   int(8000)
   ["path"]=>
   string(9) "/test.php"
   ["query"]=>
   string(14) "p=358%E2%80%9D"
 }

 // What the tests are doing:
 parse_url( "http://example.org/?p=358”" )
 array(4) {
   ["scheme"]=>
   string(4) "http"
   ["host"]=>
   string(11) "example.org"
   ["path"]=>
   string(1) "/"
   ["query"]=>
   string(8) "p=358�__"
 }

 // Example form https://github.com/fguillot/picoFeed/issues/167
 parse_url( "https://ru.wikipedia.org/wiki/Преступление_и_наказание" )
 array(3) {
   ["scheme"]=>
   string(5) "https"
   ["host"]=>
   string(16) "ru.wikipedia.org"
   ["path"]=>
   string(52) "/wiki/�_�_е�_�_�_пление_и_наказание"
 }
 }}}

 `$_SERVER['REQUEST_URI']` contains the encoded URI which is also used by
 `redirect_canonical()`. The encoding is performed client-side, like in
 your browser or with `wget`. `curl` uses the raw URL which produced an
 "Invalid request (Malformed HTTP request)" error for me.
 I'm wondering if we really have to test the decoded strings here.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/20383#comment:10>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list