[wp-hackers] Importing HTML files as pages -- been done?
dougal at gunters.org
Mon Feb 9 00:07:29 GMT 2009
Mike Schinkel wrote:
> "Dougal Campbell" <dougal at gunters.org> wrote:
>> No, a DOM-based approach is definitely better than regex.
>> Regexes for parsing HTML can get *extremely* complicated,
>> and if you start trying to write a regex-based parser from
>> scratch, you'll almost certainly miss some things.
> I agree, in general. In her specific case she said that she'd have enclosing <div>s with unique IDs identifying the content to select. That <div> would be easy to find even with strpos() and then from there a simple loop to find the applicable closing </div> would work. Yes there are potential issues with that approach, but they would be rare. For a general purpose tool those limitations wouldn't be acceptable but for a quick & dirty tool to accomplish a specific conversion it would be sufficient and easy.
Mike Little wrote:
> If you have the fortune to only need to parse machine generated XHTML,
> it may be worth having a look at my DITA importer I just released (
> I just used the plain old PHP5 DOM manipulation classes to do the work.
> There are examples of finding, removing, modifying, and adding in
> elements using straight DOM and also XPath.
> The code is far from elegant, but it might give someone a start.
True, if you know for sure that you have well-formed input, then it
greatly simplifies things. But it seemed that the discussion was
wandering towards the idea of a more generalized tool that could deal
with arbitrary sets of files. And since I've been dealing with such a
case for some time now, it was sort of weighing down my brain in that
direction anyways. :)
Dougal Campbell <dougal at gunters.org>
More information about the wp-hackers