[wp-hackers] pulling a massive HTML site into Wordpress

Alex Andrews awgandrews at gmail.com
Mon Jun 6 11:23:44 UTC 2011

SimpleDom isn't a program, its a library you pull into your PHP script.

A potential script would do this:

1. It would run on the command line.
2. The first few lines would draw all the required WordPress functions
in without loading a site. Also include HTMLDOM
define('WP_USE_THEMES', false);
You'd also need to require the html dom file also.
3. Then you'd parse the folders one by one using the PHP directory
functions http://php.net/manual/en/ref.dir.php
4. You'd then get to the file itself, you'd read it in using HTML Dom
$html = file_get_html($filename);
5. Then you'd use the library to extract the bits of the page by HTML
markup a bit like:
$content = $html->find('div[id=content]', 0)->innertext;
6. Finally you'd feed this into wp_insert_posts() filling in the title
and so on from further HTML extractions using the library
7. Loop over all your HTML files till this is done.

Would actually be fairly easy if they are in the same format.

Hope this is helpful,

All the best,


On 6 June 2011 11:02, John Black <immanence7 at gmail.com> wrote:
> On 6 Jun 2011, at 13:23, Baki Goxhaj wrote:
>> Well, if you don't know some PHP I don't know how you are going to do it
> agreed  : /
>> 1. Use http://simplehtmldom.sourceforge.net/ to pull content and from the
>> old site and map it accordingly to WordPress fields
> Even using it is posing a problem. I'm not sure how it is used. I downloaded it, but it goes where, and how do I run it? The documentation that comes with it is clearly not written for total newbies like me.
> I have a local archive of the site I need to port. The 50,000 HTML files are sectioned out across 14 main folders (by year), each containing 50 or so subfolders (by week). I want to do the port locally, so under MAMP.
> How would the simple_html_dom.php script handle this?
> It runs from within a browser?
> (oh, I cringe too with my ignorance!)
>> 2. Use a custom script to insert posts - like the one you quoted above that
>> makes use of wp_insert_posts() function
> Forgive me, but step one would create some kind of output per HTML file (stored where?) on which you run a script to pull it into an existing Wordpress installation?
>> Good luck.
>> Kindly,
>> Baki Goxhaj
> thanks!
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers

More information about the wp-hackers mailing list