[wp-trac] [WordPress Trac] #19998: Feeds can contain characters that are not valid XML

WordPress Trac noreply at wordpress.org
Sun Feb 22 11:50:23 UTC 2015


#19998: Feeds can contain characters that are not valid XML
--------------------------+------------------------------
 Reporter:  westi         |       Owner:
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  Awaiting Review
Component:  Feeds         |     Version:  3.3.1
 Severity:  normal        |  Resolution:
 Keywords:  has-patch     |     Focuses:
--------------------------+------------------------------

Comment (by mdgl):

 See #3670 for some initial ideas about a broader `esc_xml()` abstraction
 that would allow us to more easily clean-up problems such as this one.

 Many characters are invalid within XML (see http://www.w3.org/TR/REC-xml/)
 and these need to be removed whether they occur directly or as part of
 (numeric) character references (e.g. an explicit `£#x0b;` in the text).

 Note that function `wp_kses_normalize_entities()` already deals with
 invalid (numeric) character references by escaping them where necessary.
 Unfortunately, this function also allows a hard-wired list of HTML named
 entities, many of which are not valid in XML. Some simple re-factoring
 could allow this function to support both HTML and XML.

 The situation is slightly complicated for XML fields that will
 subsequently be processed as HTML, as occurs in RSS and Atom. Here, if we
 encode as CDATA we could leave such invalid named entities alone for later
 processing.

 I notice also we recently added some clean-up of directly-occurring
 control characters to `wp_kses_no_null()` as part of #28506. This strips,
 rather than escapes the characters, however and does not deal with many of
 the other UTF-8 characters that are not valid in XML. That might be
 slightly harder to address, given there appears to be some confusion over
 exactly what character encoding is being used in all situations (is using
 `blog_charset` sufficient?) as well as availability of PHP support for
 UTF-8.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/19998#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list