[wp-hackers] WP issues

Sat Jun 2 14:17:14 GMT 2007

On 2 Jun 2007, at 14:46, Sam Angove wrote:

> On 6/1/07, Geoffrey Sneddon <foolistbar at googlemail.com> wrote:
>>
>> 1. People have been asking for an XML serialiser to be used for all
>> the XML WordPress produces for years. This doesn't exist. This allows
>> invalid bytes to get into XML data. Try parsing <http:// 
>> photomatt.net/
>> comments/feed/atom/> with a compliant XML parser. You'll get a fatal
>> error. This is the exact sort of issue that an XML serialiser would
>> avoid.
>>
>>         c) "We have no way currently to ensure XHTML  
>> validity."[TICKET1526]
>> — See 1.
>
> The term "serialiser" is vague (what are you serialising from?), but I
> assume you meant that the output should be built as, say, a DOM
> object, then serialised from it to a text|application/xml document. If
> so, then I disagree. It's not a magic bullet.

Any XML structure, whether it be SAX, DOM, or something else.

> Most errors occur when users save posts and comments full of malformed
> markup and bad character data. Building output as an XML DOM won't
> help with that at all, because the broken input comes in as a string
> and will need to be corrected beforehand. If that problem can be
> solved, the class of errors that a serialiser would catch are
> comparatively easy to handle.

The serialiser will ensure that  that it is well-formed, so would  
therefore strip invalid characters.

> WordPress works the way it does now because:
>
> a) It's an impossibly bad user-experience to show a yellow screen of
> death to an end user.

Any UA that uses an XML parser already shows an XML fatal error.

> b) It's almost as bad to expect an end-user to manually correct
> well-formedness errors. (Does Aunt Mildred even know what an entity
> is?)

Well, why not use HTML? That gets around both this and the above.

>
> c) It's hard to automatically, reliably correct broken HTML. The
> rough consensus last time this came up was that it was *too* hard.

We already create broken HTML though. What's the difference?

> Until that time, the fact is that the burden of dealing with bad
> markup is pushed onto those users best able to deal with it. The
> browsers already can;

IE7 refuses to display a feed with any XML error (including invalid  
characters).
Fx 2.0 and Safari 2 both display feeds with XML errors.

> the major aggregators do (with varying degrees
> of success);

Like above, not all of the major aggregators do.

> there are excellent error-correcting feed parsing
> libraries for Python and Java.

Each correcting errors in their own way, meaning you have to reverse- 
engineer one-another.

> Yes, as a feed consumer, it sucks, these might not seem like good
> reasons, and it's perfectly fair to resent that the problem is pushed
> onto you. (In my case, the path of least resistance was switching to
> Python, where others had already dealt with it. ;)

And suffer how browser makers have for years? "This feed works in x  
aggregator, but it doesn't work in your aggregator. Please fix this  
bug." — So then you go off and do more reverse-engineering of  
malformed XML.

> In any event, it seems to me that many of the specific problems you
> cite are symptomatic of the lack of automated testing. One check with
> a validator would have caught the @content bug; unit tests would have
> proven the tag balancing issue and prevented regressions. Tests with
> an XML parser could catch the same errors as a serialiser. I'd much
> rather see a comprehensive test suite than a bunch of hideous DOM
> manipulation code.

Using SAX would allow us to behave in similar ways as we already do.  
Tag-balancing issues would never arise with a serialiser. You're  
never going to have test suites to test everything. Something  
explicitly designed to avoid these errors would avoid them happening.  
There are literally thousands of places in WP where I can insert  
content that'll cause a fatal error.

- Geoffrey Sneddon