[wp-trac] [WordPress Trac] #3670: Removing CDATA close tag ( ]]> ) unbalances the CDATA block

Tue Feb 10 11:39:08 UTC 2015

#3670: Removing CDATA close tag ( ]]> ) unbalances the CDATA block
----------------------------------------+-----------------------------
 Reporter:  scenic                      |       Owner:  andy
     Type:  defect (bug)                |      Status:  new
 Priority:  normal                      |   Milestone:  Future Release
Component:  Posts, Post Types           |     Version:  2.1
 Severity:  minor                       |  Resolution:
 Keywords:  has-patch needs-unit-tests  |     Focuses:  template
----------------------------------------+-----------------------------
Changes (by mdgl):

 * keywords:  has-patch needs-refresh needs-unit-tests => has-patch needs-
     unit-tests

Comment:

 This is a long standing problem that appears to have caused much
 frustration.  Let's take another look.

 '''Problem Statement'''

 When generating the XML output for feeds (RDF, RSS, RSS2, Atom), certain
 data fields are wrapped within CDATA blocks to make sure that they remain
 valid XML. This may be unnecessary in some cases but works fine unless the
 content of the data field happens to include the CDATA block terminator
 string ']]>'. In this situation, we need to "escape" that data to avoid an
 XML parsing error. WordPress presently does this in a "belt and braces"
 way by automatically substituting the string ']]>' by ']]>' within the
 main post content data field. There are a number of problems with this
 approach:

 * the substitution is always applied, regardless of whether the output is
 being produced for an XML CDATA field. This effectively mangles the normal
 output of post content in many situations.
 * the substitution is applied only to the main post content data field
 (i.e. that returned by "the_content()"). Other data fields that are
 wrapped within CDATA blocks are not similarly protected.

 The most common report is that this behaviour makes it difficult to
 include JavaScript code within post content whilst retaining XML/XHTML
 compliance. For example, the TinyMCE editor automatically wraps code
 within `<script>` tags in a CDATA block when you switch between the "Text"
 and "Visual" views, which will then be mangled on output. Although modern
 browsers generally seem to cope with the missing CDATA terminator (parsing
 as HTML), other XML processors and validators might not be so forgiving.

 With the rise of HTML5, interest in XHTML is doubtless waning, but
 modifying post content in this way is pretty egregious behaviour and
 probably still worth fixing.

 '''Solution Overview'''

 According to Wikipedia (http://en.wikipedia.org/wiki/CDATA), the preferred
 way to encode CDATA sections that include the string ']]>' is to create
 multiple concatenated CDATA blocks by substituting the string by
 ']]]]><![CDATA[>'. That is, we forcibly end the first CDATA block after
 the ']]' and then include the missing '>' at the beginning of the next
 CDATA block.

 This substitution should only occur if we are preparing the output for
 inclusion in an XML CDATA block. The substitution should, however be
 applied to whatever data fields we are processing.

 Interestingly, the current WordPress export tool (which generates XML/RSS2
 output) appears to implement this correctly, applying the function
 `wxr_cdata()` to (almost) all data fields, as follows:

 {{{
 function wxr_cdata( $str ) {
    if ( seems_utf8( $str ) == false )
       $str = utf8_encode( $str );
    $str = '<![CDATA[' . str_replace( ']]>', ']]]]><![CDATA[>', $str ) .
 ']]>';
    return $str;
 }
 }}}

 The updated patch applies the same approach when generating the XML output
 for feeds. Filters could be applied to the functions used to generated
 feed output, but this elegant solution might introduce backwards-
 compatibility issues if developers/users have made use of these functions
 themselves. Instead, the feed generation code has been modified to encode
 the values explicitly.

 The new encoding function has been applied to just those locations dealing
 with the main post content data field. This includes excerpts since these
 may be automatically derived from the post content and so a new function
 `get_the_excerpt_rss()` has been added to support the retrieval rather
 than display of excerpts.

 Future work could apply the same XML encoding technique to other data
 fields (see below).

 Unlike the earlier and partial patch, it did not seem appropriate to
 modify deprecated function `the_content_rss()`, again for reasons of
 backwards compatibility.

 '''Feed Reader Compability'''

 The encoding in the updated patch has been tested and appears to work
 correctly with Internet Explorer 11, Opera 27 and Firefox 35 as well as
 with XMLSpy.

 '''Future Work'''

 The updated patch creates a new function `_esc_xml_cdata()` which as the
 name suggests is intended to form part of a future and more powerful
 `esc_xml()` abstraction supported by other functions for dealing with XML
 encoding. In general our handling of XML encoding is very inconsistent and
 could benefit from a more wholesale clean-up. There are a number of
 subtleties to XML encoding and is actually quite a tricky problem to
 address.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/3670#comment:87>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform