[wp-hackers] Importing HTML files as pages -- been done?

Sun Feb 8 22:12:07 GMT 2009

No, a DOM-based approach is definitely better than regex. Regexes for 
parsing HTML can get *extremely* complicated, and if you start trying to 
write a regex-based parser from scratch, you'll almost certainly miss 
some things. There is all manner of badly formed HTML out in the wild 
(it's not all well-formed XHTML, you know), and you have to consider how 
to handle it. For example, you might have to parse something like:

  <p><img src="foo.mpg" width=512 title='image of a </p> tag' alt="it's 
fun to parse html!" /></p>

Note that you have some attributes with single-quotes, others with 
double-quotes, one with no quotes at all (valid in HTML4), a 
single-quote inside a double-quoted attribute, and there's a close-p tag 
inside an attribute, which would throw a lot of regex-based parsing 
attempts off.

Then you have the fun of deciding how you can extract the main content 
out of some arbitrary set of pages in a generic enough way to make your 
code work with different sets of pages. I've done work in this area 
before in both Perl and PHP, and you *definitely* want to use a 
DOM-based approach. There are libraries available to do the heavy lifting.

Try these bookmarks of mine for some possible starting points:

  http://delicious.com/dougal/screenscraping

-- 
Dougal Campbell <dougal at gunters.org>
http://dougal.gunters.org/
http://twitter.com/dougal
*Hire me!*

Mike Schinkel wrote:
> I think you'll be better off using preg_match(). That's what most screen-scrapers use...
>
> -Mike Schinkel
> http://mikeschinkel.com/
>
> ----- Original Message -----
> From: "Stephanie Leary" <steph at sillybean.net>
> To: wp-hackers at lists.automattic.com
> Sent: Friday, February 6, 2009 12:52:29 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [wp-hackers] Importing HTML files as pages -- been done?
>
> On Fri, 6 Feb 2009 09:50:17 -0800, Dan Gayle <dangayle at gmail.com> wrote:
>   
>> Isn't that something that you could use AJAX for? That's a simple
>> jQuery thing to do using innerHtml
>>     
>
> I was thinking more along the lines of parsing the files as XML, but I'm
> open to suggestions.
>