[wp-hackers] XHTML Strict Mode

Mark Jaquith mark.wordpress at txfx.net
Thu Aug 12 11:09:45 UTC 2004


Funny... not 2 hours ago I was looking around for similar code for 
another project, and I found something that might be quite useful here.
http://simon.incutio.com/code/php/SafeHtmlChecker.class.php.txt

It handles a lot of different cases, such as a block element used inside 
an inline element, as well as improper attributes.  It was originally 
designed to parse comments (with the obvious security restrictions), but 
it could be modified to do just about anything.

Jamie Talbot wrote:

>Ok,
>
>I've got a preliminary solution, which makes a better job of closing tags based
>on relationships derived from the XHTML 1.0 Strict DTD.  It's pretty dirty at
>the moment, full of comments, echos and debug info!  It parsed an 8k string in
>about 0.2 seconds though, so I guess time-wise it won't be a problem.
>
>You can find the file at:
>
>http://www.jamietalbot.com/wp-hacks/xvalid.phps
>
>At the moment, there are no warnings - it just goes plowing on, reordering
>things as is deemed necessary  :D  This will obviously have to change.
>Probably a set of options on just how automatic the process should be?
>
>One of the more debatable things it does is remove tags entirely that aren't in
>the specification.  Obviously this should not be a default setting, as a typo
>could cause a tag pair to be lost.  This would be a good candidate for simply
>warning the user.
>
>The file contains a lot of test cases.  Best way to test it is on a local
>server, uncommenting tests one by one to see how it deals with each.  Looking
>at the HTML source at the end is the best way I've found.
>
>The things it should do properly:
>
>Never put a tag inside a tag that it can't legally be inside.  If there are
>errors in this, the fix is simple as it only involves editing the top arrays.
>
>Correct unclosed empty tags:  ie <br> or </br> to <br />
>Remove orphan tags (closing tags that were never opened).
>Deal with greater than signs.
>Lowercases all the tags it encounters (but not the attributes).
>Attempts to merge nested duplicate tags:
>
>ie <b><i>help</i><b>me</b></b> to <b><i>help</i>me</b>
>
>The things it doesn't do yet:
>
>Remove tag pairs that have no contents: ie <p></p>.
>Deal with attributes at all.  That would be a nightmare!
>
>The things it will (hopefully!) do later:
>
>Automatically surround non list elements inside <ul>, <ol> with <li> tags.
>Deal with tags that are self closed by mistake ie: <a href="..." />
>
>My tests are by no means exhaustive.  Again, comments and bug reports
>appreciated!  I expect there will be a fair few of the latter :D
>
>Jamie.
>
>--
>http://www.jamietalbot.com/
>
>
>Quoting Brian Meidell <brian at mindflow.dk>:
>
>  
>
>>Jamie Talbot wrote:
>>
>>    
>>
>>>Basically, closing each tag as late as is legally possible.
>>>
>>>
>>>      
>>>
>>I would go for the same solution.
>>
>>If markup doesn't just pass with flying colors, it might be a good idea
>>to ask the author to confirm the result before publishing it.
>>
>>/Brian
>>
>>_______________________________________________
>>hackers mailing list
>>hackers at wordpress.org
>>http://wordpress.org/mailman/listinfo/hackers_wordpress.org
>>
>>    
>>
>
>
>_______________________________________________
>hackers mailing list
>hackers at wordpress.org
>http://wordpress.org/mailman/listinfo/hackers_wordpress.org
>
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /pipermail/hackers_wordpress.org/attachments/20040812/44b2012d/attachment.htm


More information about the hackers mailing list