[wp-hackers] WP issues

Sat Jun 2 13:46:09 GMT 2007

On 6/1/07, Geoffrey Sneddon <foolistbar at googlemail.com> wrote:
>
> 1. People have been asking for an XML serialiser to be used for all
> the XML WordPress produces for years. This doesn't exist. This allows
> invalid bytes to get into XML data. Try parsing <http://photomatt.net/
> comments/feed/atom/> with a compliant XML parser. You'll get a fatal
> error. This is the exact sort of issue that an XML serialiser would
> avoid.
>
>         c) "We have no way currently to ensure XHTML validity."[TICKET1526]
> — See 1.

The term "serialiser" is vague (what are you serialising from?), but I
assume you meant that the output should be built as, say, a DOM
object, then serialised from it to a text|application/xml document. If
so, then I disagree. It's not a magic bullet.

Most errors occur when users save posts and comments full of malformed
markup and bad character data. Building output as an XML DOM won't
help with that at all, because the broken input comes in as a string
and will need to be corrected beforehand. If that problem can be
solved, the class of errors that a serialiser would catch are
comparatively easy to handle.

WordPress works the way it does now because:

 a) It's an impossibly bad user-experience to show a yellow screen of
death to an end user.

 b) It's almost as bad to expect an end-user to manually correct
well-formedness errors. (Does Aunt Mildred even know what an entity
is?)

 c) It's hard to automatically, reliably correct broken HTML. The
rough consensus last time this came up was that it was *too* hard.

 d) If it's not too hard, well, the architecture is such that broken
HTML can be corrected in a plugin as easily as it can be in core. If
you build it, they will come.

Until that time, the fact is that the burden of dealing with bad
markup is pushed onto those users best able to deal with it. The
browsers already can; the major aggregators do (with varying degrees
of success); there are excellent error-correcting feed parsing
libraries for Python and Java.

Yes, as a feed consumer, it sucks, these might not seem like good
reasons, and it's perfectly fair to resent that the problem is pushed
onto you. (In my case, the path of least resistance was switching to
Python, where others had already dealt with it. ;)

In any event, it seems to me that many of the specific problems you
cite are symptomatic of the lack of automated testing. One check with
a validator would have caught the @content bug; unit tests would have
proven the tag balancing issue and prevented regressions. Tests with
an XML parser could catch the same errors as a serialiser. I'd much
rather see a comprehensive test suite than a bunch of hideous DOM
manipulation code.

But my TDD fanboyism is another issue for another day. I just thought
you deserved some more replies. :)