[wp-trac] [WordPress Trac] #51159: Let's expand our context specific escaping methods for wp_json_encode().

WordPress Trac noreply at wordpress.org
Thu Aug 27 18:59:19 UTC 2020


#51159: Let's expand our context specific escaping methods for wp_json_encode().
-------------------------+-------------------------------------------------
 Reporter:  whyisjake    |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Awaiting Review
Component:  Security     |     Version:
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:  javascript, template, coding-
                         |  standards
-------------------------+-------------------------------------------------

Old description:

> This document is largely sourced from a document written by @mdawaffe.
> Full credit to him for the research and thoughts put forward here. What
> I'd like to do is move this toward some actionable functions and
> developer best practices moving forward.
>

> `wp_json_encode()` is a handy helper for turning PHP into Javascript, and
> it is widely used in different places to serialize different variables.
> Imagine this hypothetical scenario:
>
> {{{
> BAD:
> <pre><?php echo json_encode( $_GET ); ?></pre>
> }}}
>
> `json_encode()` serializes data into a string that can be used as a
> JavaScript literal* (e.g., `null`, `true`, `false`, `1234` (numbers),
> `"strings"`, `[ "arrays" ]`, and `{ "objects": "oh my"}`.)
>
> JSON serialization, though, has nothing to do with HTML, and so does not
> treat characters that are special in HTML (`<`, `>`, `&`, `'`, `"`) in
> any special way: essentially, the code above is as bad as `echo
> $_GET['foo']`.
>
> Securing the above code is as simple as it always is in WordPress. We’re
> echoing data inside an HTML text node, so we use `esc_html()`:
>
> {{{
> OK:
> <pre><?php echo esc_html( json_encode( $_GET ) ); ?></pre>
> }}}
>
> Unfortunately, while secure 😀, this code is not actually correct ☹️. For
> historical reasons, `esc_html()` will not touch HTML entities (`&`):
>
> || Input || `htmlspecialchars()` || `esc_html()` ||
> || `&` || `&` 😀 || `&` 😀 ||
> || `&` || `&amp;` 😀 || `&` ☹️ ||
>
> In the example above, if there are any HTML entities in $_GET, they will
> be echoed verbatim to the page, which means they will appear unescaped to
> the page’s visitor.
>
> === In an HTML Text Node ===
>
> To faithfully represent the contents of a JSON blob in an HTML text node,
> the following code must be used:
>
> {{{
> GOOD:
> <pre><php echo _wp_specialchars(
>         wp_json_encode( $value ),
>         ENT_NOQUOTES, // Don't need to HTML-escape quotes (output is for
> a text node).
>         'UTF-8',      // json_encode() outputs UTF-8 (really just ASCII),
> not the blog's charset.
>         true,         // Do "re-escape" HTML entities: `&` ->
> `&amp;`
> ); ?><pre>
> }}}
>
> This code is only appropriate for outputting JSON in HTML text nodes.
> There are several other contexts where we would like to output JSON, and
> each of those different contexts requires different treatment.
>
> === As an HTML Attribute Node ===
>
> Though the HTML5 `.dataset` API only accepts string values for `data-*`
> attributes, jQuery will automatically parse `data-*` attribute values
> that are JSON serializations. So, when using jQuery, the following
> pattern is often handy:
>
> {{{
> BAD:
> <div data-foo='<?php echo json_encode( $foo ); ?>'>
> }}}
>
> Handy but, as we should know by now, insecure 😀. We need to HTML-escape
> the output.
>
> Like `esc_html()`, `esc_attr()` also leaves HTML entities untouched, so,
> again, the solution is to “manually” use `_wp_specialchars()`:
>
> {{{
> GOOD:
> <div data-foo='<?php echo _wp_specialchars(
>         wp_json_encode( $foo ),
>         ENT_QUOTES, // Must HTML-escape quotes (output is for an
> attriibute node).
>         'UTF-8',    // json_encode() outputs UTF-8 (really just ASCII),
> not the blog's charset.
>         true,       // Do "re-escape" HTML entities: `&` ->
> `&amp;`
> ); ?>'>
> }}}
>
> It’s important to note that this code snippet is suitable for whole HTML
> attributes. It is not appropriate for use on part of an HTML attribute.
>
> In general, when we want to output a JSON blob as part of an HTML
> attribute, it’s because we’re trying to use it as a JavaScript literal:
>
> {{{
> BAD:
> <a href="#" onclick="doSomething( <?php echo json_encode( $click_data );
> ?> )">
> }}}
>
> We’ve seen that using json_encode() by itself in this context is not
> secure, but neither is using the above HTML attribute code
> (`_wp_specialchars( json_encode(), … )`). In the `data-foo` case above,
> we’re outputting JSON. In the `onclick` case, we’re outputting a
> JavaScript literal. `_wp_specialchars()` does enough to the JSON blob to
> make it safe for use as an HTML attribute, but it does not do anything to
> make it safe for use within JavaScript.
>
> Despite claiming above that `json_encode()` outputs JavaScript literals,
> it’s more complicated than that.
>
> In a `<script>` Element
>
> The problem is that, in a <script> element, for example, we have to
> consider how the contents are interpreted as HTML first and then as
> JavaScript second.
>
> The following pattern seems helpful. Use `json_encode()` to output a PHP
> string as a JavaScript string literal:
>
> {{{
> BAD:
> <script>
> var foo = <?php echo json_encode( (string) $foo ); ?>;
> </script>
> }}}
>
> There are multiple ways in which this is insecure.
>
> First, for some pages, HTML entities and the characters they represent
> are one and the same in the `<script>` element context. If `$foo` has
> HTML entities in it, problems will happen. For example, for some pages,
> the following two scripts are the same:
>
> {{{
> WAT?
> <script>
> var foo = "Hello"; alert(/LOL/); var foo="LOL";
> </script>
> <script>
> var foo = "Hello"; alert(/LOL/); var foo="LOL";
> </script>
> }}}
>
> Exactly how HTML entities are interpreted in <script> elements depends on
> Content-Type, DOCTYPE, browser, etc. So we need a way to securely use
> JSON in this context that does not depend HTML-escaping
> (`_wp_specialchars()`).
>
> So we can’t depend on HTML-escaping to save us, and we need to make sure
> certain strings are never output. Luckily, there are a couple of other
> widely implemented transformations that can help.
>
> {{{
> GOOD:
> <script>
> var foo = decodeURIComponent( '<?php echo rawurlencode( (string) $foo );
> ?>' );
> </script>
> }}}
>
> If we’re outputting a string, we can URL-encode it in PHP and URL-decode
> it in JavaScript. (We do have to make sure we use the right functions:
> `rawurlencode()` is slightly better than `urlencode()` here, and
> `decodeURIComponent()` is required over JavaScript’s deprecated
> `unescape()`.)
>
> The useful property of URL-encoding that we’re exploiting is that the
> transformed string is guaranteed not to have any characters in it that
> are regarded as special in the HTML context (`<`, `>`, `&`, `'`, `"`), so
> there’s no way to output `"` or an HTML Comment Opener.
>
> For non-scalar data, this can be extended via json_encode():
>
> {{{
> GOOD:
> <script>
> var foo = JSON.parse( decodeURIComponent( '<?php
>         echo rawurlencode( wp_json_encode( $foo ) );
> ?>' );
> </script>
> }}}
>
> Rather than using the output of `json_encode()` as a JavaScript literal,
> the code above URL-encodes the whole serialization (braces, quotation
> marks and all), and we URL-decode and parse the resulting string in
> JavaScript to get back the structured data.
>
> This URL-encoding is a bit tedious and results in some ugly looking
> JavaScript. (Ugly JavaScript is better than vulnerable JavaScript!) We
> can instead use JavaScript’s Unicode-escaping:
>
> HTML Special Characters:
> {{{
> <script>
> "<" === "\u003c" // true: < is U+3C
> ">" === "\u003e" // true: < is U+3E
> "&" === "\u0026" // true: & is U+26
> "'" === "\u0027" // true: " is U+27
> '"' === "\u0022" // true: " is U+22
> </script>
> }}}
>
> PHP’s `json_encode()` has an `$options` parameter, which can be used to
> always Unicode-escape these HTML special characters:
>
> ALMOST (PHP 5.3+):
> <script>
> var foo = <?php echo wp_json_encode( $foo, JSON_HEX_TAG | JSON_HEX_AMP |
> JSON_HEX_APOS | JSON_HEX_QUOT ); ?>;
> </script>
>
> These constants are only available as of PHP 5.3.
>
> Also, just replacing those characters isn’t good enough. We also need to
> Unicode-escape ``` and `$` because of their special meanings in
> JavaScript template literals.
>
> GOOD (PHP 5.3+):
>
> {{{
> <script>
> var message = `hello, ${<?php echo str_replace(
>         array( '`', '$' ),
>         array( '\\u0060', '\\u0024' ),
>         wp_json_encode( $user, JSON_HEX_TAG | JSON_HEX_AMP |
> JSON_HEX_APOS | JSON_HEX_QUOT )
> ); ?>.name}`;
> </script>
> }}}

New description:

 This document is largely sourced from a document written by @mdawaffe.
 Full credit to him for the research and thoughts put forward here. What
 I'd like to do is move this toward some actionable functions and developer
 best practices moving forward.


 `wp_json_encode()` is a handy helper for turning PHP into Javascript, and
 it is widely used in different places to serialize different variables.
 Imagine this hypothetical scenario:

 {{{
 BAD:
 <pre><?php echo json_encode( $_GET ); ?></pre>
 }}}

 `json_encode()` serializes data into a string that can be used as a
 JavaScript literal* (e.g., `null`, `true`, `false`, `1234` (numbers),
 `"strings"`, `[ "arrays" ]`, and `{ "objects": "oh my"}`.)

 JSON serialization, though, has nothing to do with HTML, and so does not
 treat characters that are special in HTML (`<`, `>`, `&`, `'`, `"`) in any
 special way: essentially, the code above is as bad as `echo $_GET['foo']`.

 Securing the above code is as simple as it always is in WordPress. We’re
 echoing data inside an HTML text node, so we use `esc_html()`:

 {{{
 OK:
 <pre><?php echo esc_html( json_encode( $_GET ) ); ?></pre>
 }}}

 Unfortunately, while secure 😀, this code is not actually correct ☹️. For
 historical reasons, `esc_html()` will not touch HTML entities (`&`):

 || Input || `htmlspecialchars()` || `esc_html()` ||
 || `&` || `&` 😀 || `&` 😀 ||
 || `&` || `&amp;` 😀 || `&` ☹️ ||

 In the example above, if there are any HTML entities in $_GET, they will
 be echoed verbatim to the page, which means they will appear unescaped to
 the page’s visitor.

 === In an HTML Text Node ===

 To faithfully represent the contents of a JSON blob in an HTML text node,
 the following code must be used:

 {{{
 GOOD:
 <pre><php echo _wp_specialchars(
         wp_json_encode( $value ),
         ENT_NOQUOTES, // Don't need to HTML-escape quotes (output is for a
 text node).
         'UTF-8',      // json_encode() outputs UTF-8 (really just ASCII),
 not the blog's charset.
         true,         // Do "re-escape" HTML entities: `&` ->
 `&amp;`
 ); ?><pre>
 }}}

 This code is only appropriate for outputting JSON in HTML text nodes.
 There are several other contexts where we would like to output JSON, and
 each of those different contexts requires different treatment.

 === As an HTML Attribute Node ===

 Though the HTML5 `.dataset` API only accepts string values for `data-*`
 attributes, jQuery will automatically parse `data-*` attribute values that
 are JSON serializations. So, when using jQuery, the following pattern is
 often handy:

 {{{
 BAD:
 <div data-foo='<?php echo json_encode( $foo ); ?>'>
 }}}

 Handy but, as we should know by now, insecure 😀. We need to HTML-escape
 the output.

 Like `esc_html()`, `esc_attr()` also leaves HTML entities untouched, so,
 again, the solution is to “manually” use `_wp_specialchars()`:

 {{{
 GOOD:
 <div data-foo='<?php echo _wp_specialchars(
         wp_json_encode( $foo ),
         ENT_QUOTES, // Must HTML-escape quotes (output is for an
 attriibute node).
         'UTF-8',    // json_encode() outputs UTF-8 (really just ASCII),
 not the blog's charset.
         true,       // Do "re-escape" HTML entities: `&` ->
 `&amp;`
 ); ?>'>
 }}}

 It’s important to note that this code snippet is suitable for whole HTML
 attributes. It is not appropriate for use on part of an HTML attribute.

 In general, when we want to output a JSON blob as part of an HTML
 attribute, it’s because we’re trying to use it as a JavaScript literal:

 {{{
 BAD:
 <a href="#" onclick="doSomething( <?php echo json_encode( $click_data );
 ?> )">
 }}}

 We’ve seen that using json_encode() by itself in this context is not
 secure, but neither is using the above HTML attribute code
 (`_wp_specialchars( json_encode(), … )`). In the `data-foo` case above,
 we’re outputting JSON. In the `onclick` case, we’re outputting a
 JavaScript literal. `_wp_specialchars()` does enough to the JSON blob to
 make it safe for use as an HTML attribute, but it does not do anything to
 make it safe for use within JavaScript.

 Despite claiming above that `json_encode()` outputs JavaScript literals,
 it’s more complicated than that.

 In a `<script>` Element

 The problem is that, in a <script> element, for example, we have to
 consider how the contents are interpreted as HTML first and then as
 JavaScript second.

 The following pattern seems helpful. Use `json_encode()` to output a PHP
 string as a JavaScript string literal:

 {{{
 BAD:
 <script>
 var foo = <?php echo json_encode( (string) $foo ); ?>;
 </script>
 }}}

 There are multiple ways in which this is insecure.

 First, for some pages, HTML entities and the characters they represent are
 one and the same in the `<script>` element context. If `$foo` has HTML
 entities in it, problems will happen. For example, for some pages, the
 following two scripts are the same:

 {{{
 WAT?
 <script>
 var foo = "Hello"; alert(/LOL/); var foo="LOL";
 </script>
 <script>
 var foo = "Hello"; alert(/LOL/); var foo="LOL";
 </script>
 }}}

 Exactly how HTML entities are interpreted in <script> elements depends on
 Content-Type, DOCTYPE, browser, etc. So we need a way to securely use JSON
 in this context that does not depend HTML-escaping (`_wp_specialchars()`).

 So we can’t depend on HTML-escaping to save us, and we need to make sure
 certain strings are never output. Luckily, there are a couple of other
 widely implemented transformations that can help.

 {{{
 GOOD:
 <script>
 var foo = decodeURIComponent( '<?php echo rawurlencode( (string) $foo );
 ?>' );
 </script>
 }}}

 If we’re outputting a string, we can URL-encode it in PHP and URL-decode
 it in JavaScript. (We do have to make sure we use the right functions:
 `rawurlencode()` is slightly better than `urlencode()` here, and
 `decodeURIComponent()` is required over JavaScript’s deprecated
 `unescape()`.)

 The useful property of URL-encoding that we’re exploiting is that the
 transformed string is guaranteed not to have any characters in it that are
 regarded as special in the HTML context (`<`, `>`, `&`, `'`, `"`), so
 there’s no way to output `"` or an HTML Comment Opener.

 For non-scalar data, this can be extended via json_encode():

 {{{
 GOOD:
 <script>
 var foo = JSON.parse( decodeURIComponent( '<?php
         echo rawurlencode( wp_json_encode( $foo ) );
 ?>' );
 </script>
 }}}

 Rather than using the output of `json_encode()` as a JavaScript literal,
 the code above URL-encodes the whole serialization (braces, quotation
 marks and all), and we URL-decode and parse the resulting string in
 JavaScript to get back the structured data.

 This URL-encoding is a bit tedious and results in some ugly looking
 JavaScript. (Ugly JavaScript is better than vulnerable JavaScript!) We can
 instead use JavaScript’s Unicode-escaping:

 HTML Special Characters:
 {{{
 <script>
 "<" === "\u003c" // true: < is U+3C
 ">" === "\u003e" // true: < is U+3E
 "&" === "\u0026" // true: & is U+26
 "'" === "\u0027" // true: " is U+27
 '"' === "\u0022" // true: " is U+22
 </script>
 }}}

 PHP’s `json_encode()` has an `$options` parameter, which can be used to
 always Unicode-escape these HTML special characters:

 {{{
 ALMOST (PHP 5.3+):
 <script>
 var foo = <?php echo wp_json_encode( $foo, JSON_HEX_TAG | JSON_HEX_AMP |
 JSON_HEX_APOS | JSON_HEX_QUOT ); ?>;
 </script>
 }}}

 These constants are only available as of PHP 5.3.

 Also, just replacing those characters isn’t good enough. We also need to
 Unicode-escape ``` and `$` because of their special meanings in
 JavaScript template literals.

 {{{
 GOOD (PHP 5.3+):
 <script>
 var message = `hello, ${<?php echo str_replace(
         array( '`', '$' ),
         array( '\\u0060', '\\u0024' ),
         wp_json_encode( $user, JSON_HEX_TAG | JSON_HEX_AMP | JSON_HEX_APOS
 | JSON_HEX_QUOT )
 ); ?>.name}`;
 </script>
 }}}

--

Comment (by whyisjake):

 Ideally, WordPress core has a few functions that can replace the laborious
 methods to escape Javascript content given the different contexts.
 Something like the following:

 * `esc_json()`
 * `esc_js_attr()` or maybe `esc_attr( $thing, 'json' )`
 * `esc_wp_json_encode()` etc...

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/51159#comment:1>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list