[wp-hackers] pulling a massive HTML site into Wordpress

Mon Jun 6 10:20:56 UTC 2011

On 6 Jun 2011, at 14:03, Alex Andrews wrote:

> Sadly not, because HTML pages are in no standard format that a given
> plugin could guess. The only possibility would be extracting the raw
> text from the HTML, but because this is the sort of operation that
> you'd have to do in a script anyway, and still then you'd need to
> write something to add them all automatically to WordPress, then you
> might as well cut out the middle man and parse the HTML directly.
> 
> Remember, in learning how to code a script then coding one, you'll
> save time anyway, compared to copy and pasting anyway.

Definitely. Cutting and pasting 50,000 HTML files is either the work of 1000 people like me, or it's a quick path to insanity for me alone.

So I need to: 1) find a script / working method to parse the HTML files so the content I need is separated out; 2) make sure I have my Wordpress post structure set up to render the content I need in the way I need it; 3) find a script / working method to take the parsed data, stored in whatever way it is stored, from 50,000 HTML files and input it into 50,000 new Wordpress posts, preserving dates, categories, authors, and other things I need aside from the main content.

I guess I have to get on top of PHP Simple, HTML DOM Parser first.

Hopefully others will have further clues!

Thanks for your replies so far.

best,
JB