[wp-trac] [WordPress Trac] #18311: Caption shortcode doesn't support markup in the caption.

WordPress Trac wp-trac at lists.automattic.com
Fri Oct 21 20:59:16 UTC 2011


#18311: Caption shortcode doesn't support markup in the caption.
-------------------------+-----------------------------
 Reporter:  prettyboymp  |       Owner:
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Future Release
Component:  TinyMCE      |     Version:
 Severity:  normal       |  Resolution:
 Keywords:  needs-patch  |
-------------------------+-----------------------------
Changes (by jeremyclarke):

 * cc: jer@… (added)


Comment:

 The first step of the problem is in {{{image_add_caption()}}}, which is
 attached to the {{{image_send_to_editor}}} filter and handles actually
 inserting the caption shortcode into post content. Notably, the
 {{{image_add_caption()}}} function is completely undocumented ( {@internal
 Missing Short Description}}) so it's hard to know what was intended by
 it's behavior, though clearly the idea was to simply remove anything that
 might be offensive inside a shortcode attribute without consideration for
 formatting that might be important to preserve.

 Interestingly the upload manager popup correctly saves whatever HTML you
 insert in the caption without stripping it or showing an error. The
 problem is that when the shortcode is inserted into the post,
 {{{image_add_caption()}}} runs the following string replacement, which
 completely neutralises any HTML in the caption by replacing it with HTML
 entities:

 {{{
 $caption = str_replace( array( '>',    '<',    '"',      "'" ),
         array( '&gt;', '&lt;', '&quot;', '&#039;' ),
                 $caption
           );
 }}}

 This is despite the fact that re-opening the image in the gallery shows
 you that the HTML is still saved correctly and unmolested in the caption
 field, which is very confusing. At a minimum the media uploader should not
 save any HTML in the caption field that will not ultimately function in
 the caption shortcode itself.

 If the str_replace is removed then inserted HTML continues to not function
 because when the actual $caption content is output, addslashes() is run on
 it first( {{{addslashes($caption)}}}) which creates useless broken HTML.

 The issue is further complicated by the fact that the caption text is
 stored in the {{{caption=""}}} attribute of the shortcode, so any double-
 quotes within the caption text truly don't make any sense and absolutely
 SHOULD be either escaped/slashed or replaced with single-quotes for
 storage inside the shortcode attribute.

 '''THE CAPTION SHORTCODE IS FUNDAMENTALLY FLAWED, NUTS'''

 Looking at it closely it seems like the architecture of the caption
 shortcode is fundamentally flawed because it uses an 'attribute' of the
 shortcode (caption="") to store the caption text, and attributes are
 rightly blocked from having the kind of formatting a useful caption needs.
 Since the shortcode has only one piece of content that can go between the
 two tags (i.e. between [caption] and [/caption]) that content should have
 been the caption text itself (which could then take advantage of being
 regular post content), rather than the image html tag that is currently
 there (which could have been broken down into esc_attr()-safe attributes
 stored inside the initial [caption] shortcode and parsed into an img tag
 during shortcode execution).

 Unfortunately re-architecting the caption shortcode now is probably off
 the table, and I doubt we can get traction for fully replacing it with a
 new more powerful shortcode in core. That is the problem we face though,
 the current architecture is just wrong for captions that are any more than
 another kind of alt label.


 '''WE NEED SOME WAY TO ESCAPE HTML IN THE CAPTION TEXT ON SAVE, THEN
 UNESCAPE ON DISPLAY'''

 So the trick we need to pull off is getting the system to somehow allow
 for at least a limited set of HTML (personally I'd settle for just links)
 inside the 'caption' attribute of the [caption] shortcode, then making the
 shortcode function {{{img_caption_shortcode()}}} display it correctly. If
 that worked, then it should be possible to get the visual editor to stop
 messing it up when you switch back and forth.

 To me it seems like the best approach within the current shortcode would
 be to carefully account for particular HTML tags inside the 'caption'
 attribute by escaping them in a particular way while inserting them into
 the post, then when executing the shortcode carefully accounting for the
 escaped HTML and reverting it into an active state.

 '''USING ADDSLASHES()/STRIPSLASHES() TO ESCAPE CAPTION TEXT: FAIL'''

 The simplest version of that idea is to just remove the anti-html
 str_replace (quoted above) from {{{image_add_caption()}}} and instead
 depend on addslashes() to keep the content safe, then, in
 {{img_caption_shortcode()}} using stripslashes() to get the content back
 to it's original state.

 Unfortunately we can't just stripslashes() in the current state of the
 codebase because a slash-escaped link tag inside the caption attribute
 causes the whole shortcode's $attr array to lose it's lunch and become
 broken, here's and example where the caption was "Testing a link" and the
 link was an a tag leading to test.com which used double-quotes around the
 URL:

 {{{
  [id] => attachment_74823
  [align] => alignnone
  [width] => 375
  [0] => caption="Testing
  [1] => a
  [2] =>  href="http://test.com">link"
 }}}

 That's the $attr array recieved by the shortcode function
 img_caption_shortcode(), so before we even have a chance to
 stripslashes($caption) the shortcode fails completely because the $caption
 attribute is mandatory for the shortcode and not present in $atts.

 The culprit here is {{{shortcode_parse_atts()}}} which uses some intense
 regex voodoo I can't even begin to understand but is clearly not ready for
 a world where slash-escaped double-quotes are part of a shortcode
 attribute.

 However, in this case switching to single-quotes actually fixes the
 problem, and results in the link showing correctly on the image. Maybe a
 final solution will involve enforcing single-quotes -only within caption
 text HTML, though that seems awkward, since it will be hard to distinguish
 between a double-quote used as part of a link and one used as part of
 regular text, and we don't want to replace text-quotes with a single-quote
 as that would be confusing for authors.

 '''USING URLENCODE()/URLDECODE() TO ESCAPE CAPTION TEXT: UGLY'''

 An alternate idea would be to use another escaping system entirely, to
 completely neutralize the caption text before sending it to the editor,
 then re-activate it when parsing the shortcode on display. I tried
 replacing addslashes/stripshashes with urlencode/urldecode and the results
 were very promising. In {{{image_add_caption()}}} I use
 urlencode($caption) before sending it to the editor, which sends things
 like this to the post:

 {{{
 caption="TEsting+a+%3Ca+href%3D%22http%3A%2F%2Ftest.com%22%3Elink%3C%2Fa%3E+with+also+some+%3Cstrong%3Ebold%3C%2Fstrong%3E+text.+"
 }}}

 Then in {{{img_caption_shortcode()}}} I run $caption =
 urldecode($caption), which reverts it back to it's original state and
 shows perfectly fine on the post preview.

 This even avoids having the visual editor break the content when switching
 back and forth, but unfortunately it also looks AWFUL in the visual
 editor, because the urlencode() version of the caption is shown, which
 would probably confuse even users savvy enough to be adding HTML links in
 the caption field in the first place. This same awfulness applies to
 trying to read or edit the caption inside the HTML editor, which is
 infinitely more difficult then when the caption text is shown as plain
 text. Yet another place where the ugly urlencoded version shows is in the
 visual editor when you edit image settings.

 '''WHAT OTHER FORMAT MIGHT WORK?'''

 It seems like a big problem here is that any attempt to escape the HTML
 will need us to account for every single place where it is shown to users
 and unescape it for display in that space. Otherwise the escaping pattern
 would have to be simple/obvious enough that looking at the escaped version
 still made it clear what the HTML was for.

 Maybe the solution is some kind of non-HTML code similar to shortcodes
 that could be used inside the caption attribute to identify links etc.
 This is not ideal as it creates yet another layer of syntax to learn and I
 would avoid that if at all possible, but it may be the only way to get
 around the problematic architecture that the caption shortcode currently
 suffers from. Maybe something that emulates mediawiki markup like this
 could work?

 {{{
 [http://test.com/ test]
 }}}

 '''OOF! THERE IS A REASON THIS HAS GONE SO LONG WITHOUT A FIX'''

 Okay enough from me for today. Just wanted to dig into the code and do
 some research about the problem. I think that this probably needs more
 than just a patch, it needs thinking and decisions because so far there's
 still no solution that seems appropriate despite the obviousness of the
 need.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/18311#comment:9>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list