[wp-hackers] Importing HTML files as pages -- been done?
mike at zed1.com
Sun Feb 8 23:53:03 GMT 2009
2009/2/8 Dougal Campbell <dougal at gunters.org>:
> No, a DOM-based approach is definitely better than regex. Regexes for
> parsing HTML can get *extremely* complicated, and if you start trying to
> write a regex-based parser from scratch, you'll almost certainly miss some
> things. There is all manner of badly formed HTML out in the wild (it's not
> all well-formed XHTML, you know),
If you have the fortune to only need to parse machine generated XHTML,
it may be worth having a look at my DITA importer I just released (
I just used the plain old PHP5 DOM manipulation classes to do the work.
There are examples of finding, removing, modifying, and adding in
elements using straight DOM and also XPath.
The code is far from elegant, but it might give someone a start.
More information about the wp-hackers