[wp-hackers] Importing HTML files as pages -- been done?

Sun Feb 8 23:11:53 GMT 2009

"Dougal Campbell" <dougal at gunters.org> wrote:
> No, a DOM-based approach is definitely better than regex. 
> Regexes for parsing HTML can get *extremely* complicated, 
> and if you start trying to write a regex-based parser from 
> scratch, you'll almost certainly miss some things. 

I agree, in general.  In her specific case she said that she'd have enclosing <div>s with unique IDs identifying the content to select. That <div> would be easy to find even with strpos() and then from there a simple loop to find the applicable closing </div> would work.  Yes there are potential issues with that approach, but they would be rare.  For a general purpose tool those limitations wouldn't be acceptable but for a quick & dirty tool to accomplish a specific conversion it would be sufficient and easy.

Still, for those that feel that only a DOM approach will do I'll not stand in the way by debating it further. :-)

-Mike Schinkel
http://mikeschinkel.com/