[wp-trac] [WordPress Trac] #41304: Bad protocol sensitization in KSES for URLs NOT RFC 3986 compliant

WordPress Trac noreply at wordpress.org
Thu Jul 13 10:41:31 UTC 2017


#41304: Bad protocol sensitization in KSES for URLs NOT RFC 3986 compliant
--------------------------+-----------------------------
 Reporter:  bogdanpreda   |      Owner:
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  General       |    Version:  4.8
 Severity:  normal        |   Keywords:
  Focuses:                |
--------------------------+-----------------------------
 For URL's that are passed through the kses sanitizer.
 As specified in RFC 3986, Section 3.3

   The path component contains data, usually organized in hierarchical
 form, that, along with data in the non-hierarchical query component
 (Section 3.4), serves to identify a resource within the scope of the URI's
 scheme and naming authority (if any). The path is terminated by the first
 question mark ("?") or number sign ("#") character, or by the end of the
 URI.

   If a URI contains an authority component, then the path component must
 either be empty or begin with a slash ("/") character. If a URI does not
 contain an authority component, then the path cannot begin with two slash
 characters ("//"). In addition, a URI reference (Section 4.1) may be a
 relative-path reference, in which case the first path segment cannot
 contain a colon (":") character. The ABNF requires five separate rules to
 disambiguate these cases, only one of which will match the path substring
 within a given URI reference. We use the generic term "path component" to
 describe the URI substring matched by the parser to one of these rules.

 So colon(':') is allowed inside URL's. When trying to split the URL like
 this:
 {{{#!php
 <?php
 function wp_kses_bad_protocol_once($string, $allowed_protocols, $count = 1
 ) {
         $string2 = preg_split( '/:|&#0*58;|&#x0*3a;/i', $string, 2 );
 ...
 }}}

 for URL's that do not contain a specified scheme and use colon (':')
 inside the URL this breaks and returns only the second part of the URL
 after the colon. Eg:

 //t0.gstatic.com/images?q=tbn:ANd9GcSxT2q6fV-
 59s5hq5a03fpgsFYzVtL014iARzGRG7S_3CUjYpIGNlQx0ruGtVl5KCAEOxAtb_ZQ

 will return: ANd9GcSxT2q6fV-
 59s5hq5a03fpgsFYzVtL014iARzGRG7S_3CUjYpIGNlQx0ruGtVl5KCAEOxAtb_ZQ

 Also a “network-path reference” should be implied, in the current format
 you assume a scheme exists beforehand.

 Changing the split to:
 {{{#!php
 <?php
 ...
 $string2 = preg_split( '/(:\/\/)|&#0*58;|&#x0*3a;/i', $string, 2 );
 if ( isset($string2[1]) && ! preg_match('%/\?%', $string2[0]) ) {
   ...
   $string = $protocol . '//' . $string;
 }
 ...
 }}}

 fixes this issue and is more compliant without breaking sensitization.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/41304>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list