[wp-hackers] Importing HTML files as pages -- been done?
Dougal Campbell
dougal at gunters.org
Sun Feb 8 22:12:07 GMT 2009
No, a DOM-based approach is definitely better than regex. Regexes for
parsing HTML can get *extremely* complicated, and if you start trying to
write a regex-based parser from scratch, you'll almost certainly miss
some things. There is all manner of badly formed HTML out in the wild
(it's not all well-formed XHTML, you know), and you have to consider how
to handle it. For example, you might have to parse something like:
<p><img src="foo.mpg" width=512 title='image of a </p> tag' alt="it's
fun to parse html!" /></p>
Note that you have some attributes with single-quotes, others with
double-quotes, one with no quotes at all (valid in HTML4), a
single-quote inside a double-quoted attribute, and there's a close-p tag
inside an attribute, which would throw a lot of regex-based parsing
attempts off.
Then you have the fun of deciding how you can extract the main content
out of some arbitrary set of pages in a generic enough way to make your
code work with different sets of pages. I've done work in this area
before in both Perl and PHP, and you *definitely* want to use a
DOM-based approach. There are libraries available to do the heavy lifting.
Try these bookmarks of mine for some possible starting points:
http://delicious.com/dougal/screenscraping
--
Dougal Campbell <dougal at gunters.org>
http://dougal.gunters.org/
http://twitter.com/dougal
*Hire me!*
Mike Schinkel wrote:
> I think you'll be better off using preg_match(). That's what most screen-scrapers use...
>
> -Mike Schinkel
> http://mikeschinkel.com/
>
> ----- Original Message -----
> From: "Stephanie Leary" <steph at sillybean.net>
> To: wp-hackers at lists.automattic.com
> Sent: Friday, February 6, 2009 12:52:29 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [wp-hackers] Importing HTML files as pages -- been done?
>
> On Fri, 6 Feb 2009 09:50:17 -0800, Dan Gayle <dangayle at gmail.com> wrote:
>
>> Isn't that something that you could use AJAX for? That's a simple
>> jQuery thing to do using innerHtml
>>
>
> I was thinking more along the lines of parsing the files as XML, but I'm
> open to suggestions.
>
More information about the wp-hackers
mailing list