[wp-trac] [WordPress Trac] #19998: Feeds can contain characters that are not valid XML
WordPress Trac
noreply at wordpress.org
Fri Jun 19 00:25:59 UTC 2020
#19998: Feeds can contain characters that are not valid XML
----------------------------------------+------------------------------
Reporter: westi | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Feeds | Version: 3.3.1
Severity: normal | Resolution:
Keywords: has-patch needs-unit-tests | Focuses:
----------------------------------------+------------------------------
Comment (by pbiron):
As mentioned in the commit message above, core now has an `esc_xml()`
function (and accompanying `esc_xml` filter).
However, that function does **not** (yet) strip characters that are not
legal in XML. As stated in the [https://github.com/GoogleChromeLabs/wp-
sitemaps/pull/192 PR] that added `esc_xml()` to the sitemaps feature
plugin:
> it is an open question what to do when the text passed to these
functions/filter contain characters which are not allowed in XML.
It would be easy to add functionality like
[https://core.trac.wordpress.org/raw-attachment/ticket/19998/19998b.patch
19998b.patch] to `esc_xml()` (once the regex is adjusted to include code
points beyond the Basic Multilingual Plane, see below), if that is the
consensus of what people want.
Stripping illegal characters ''may be'' appropriate for feeds (and
sitemaps and oembeds), but **may not** be appropriate for exports (i.e.,
WXR) as there is a small (but not 0) probability that those illegal chars
actually have meaning in the context of the plugin that wrote them to the
DB (e.g., stored in post meta) and stripping them out ''might'' cause a
problem when the content with them stripped out is imported into another
site.
So, maybe `esc_xml()` could take a 2nd parameter indicating what to do
when there are illegal characters (that accepted values like `strip`,
`empty_string` and `wp_error`)? Default would be `strip`; `emtpy_string`
would indicate that an empty string should be returned if there are
**any** illegal chars; `wp_error` would indicate that a `WP_Error` should
be returned...so that, for instance, the exporter could skip exporting
something that contained illegal chars.
Instead of a 2nd parameter, another possibility would be have the `strip`,
`empty_string`, or `wp_error` behavior handled by hooking into the
`esc_xml` filter with various callbacks.
p.s. the regex to strip illegal chars should actually be as follows (and
the replacement should be the empty string, and not a space character):
{{{#!php
preg_replace(
'/[^\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u',
'',
$string
);
}}}
--
Ticket URL: <https://core.trac.wordpress.org/ticket/19998#comment:17>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list