[wp-trac] [WordPress Trac] #19998: Feeds can contain characters that are not valid XML

WordPress Trac noreply at wordpress.org
Fri Jun 19 00:25:59 UTC 2020


#19998: Feeds can contain characters that are not valid XML
----------------------------------------+------------------------------
 Reporter:  westi                       |       Owner:  (none)
     Type:  defect (bug)                |      Status:  new
 Priority:  normal                      |   Milestone:  Awaiting Review
Component:  Feeds                       |     Version:  3.3.1
 Severity:  normal                      |  Resolution:
 Keywords:  has-patch needs-unit-tests  |     Focuses:
----------------------------------------+------------------------------

Comment (by pbiron):

 As mentioned in the commit message above, core now has an `esc_xml()`
 function (and accompanying `esc_xml` filter).

 However, that function does **not** (yet) strip characters that are not
 legal in XML.  As stated in the [https://github.com/GoogleChromeLabs/wp-
 sitemaps/pull/192 PR] that added `esc_xml()` to the sitemaps feature
 plugin:

 > it is an open question what to do when the text passed to these
 functions/filter contain characters which are not allowed in XML.

 It would be easy to add functionality like
 [https://core.trac.wordpress.org/raw-attachment/ticket/19998/19998b.patch
 19998b.patch] to `esc_xml()` (once the regex is adjusted to include code
 points beyond the Basic Multilingual Plane, see below), if that is the
 consensus of what people want.

 Stripping illegal characters ''may be'' appropriate for feeds (and
 sitemaps and oembeds), but **may not** be appropriate for exports (i.e.,
 WXR) as there is a small (but not 0) probability that those illegal chars
 actually have meaning in the context of the plugin that wrote them to the
 DB (e.g., stored in post meta) and stripping them out ''might'' cause a
 problem when the content with them stripped out is imported into another
 site.

 So, maybe `esc_xml()` could take a 2nd parameter indicating what to do
 when there are illegal characters (that accepted values like `strip`,
 `empty_string` and  `wp_error`)?  Default would be `strip`; `emtpy_string`
 would indicate that an empty string should be returned if there are
 **any** illegal chars; `wp_error` would indicate that a `WP_Error` should
 be returned...so that, for instance, the exporter could skip exporting
 something that contained illegal chars.

 Instead of a 2nd parameter, another possibility would be have the `strip`,
 `empty_string`, or `wp_error` behavior handled by hooking into the
 `esc_xml` filter with various callbacks.

 p.s. the regex to strip illegal chars should actually be as follows (and
 the replacement should be the empty string, and not a space character):

 {{{#!php
 preg_replace(
 '/[^\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u',
         '',
         $string
 );
 }}}

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/19998#comment:17>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list