[wp-hackers] pulling a massive HTML site into Wordpress

Christopher Ross cross at thisismyurl.com
Mon Jun 6 12:01:15 UTC 2011

John, you'll need to do some programming no matter what but the core of what you need to do is fairly straight forward.

First, identify the content that you want to insert into WP. You can do this through something as simple as a RegEx search in Dreamweaver but at the end of the process you'll want to have 50,000 files that contain nothing except the content you want.

Next, write a small program in PHP that'll scan all the files in the directory and insert the code into WordPress. This code will loop files and can be easily modified for you - http://lab.revoke.ca/2009/04/php-loop-through-files-in-folder/ to include the code to insert the post with http://codex.wordpress.org/Function_Reference/wp_insert_post

Importing the posts into WP is easy, once you have the content ready to import.


On 2011-06-06, at 7:54 AM, John Black wrote:

> On 6 Jun 2011, at 14:23, Alex Andrews wrote:
>> SimpleDom isn't a program, its a library you pull into your PHP script.
>> A potential script would do this:
>> 1. It would run on the command line.
>> 2. The first few lines would draw all the required WordPress functions
>> in without loading a site. Also include HTMLDOM
>> define('WP_USE_THEMES', false);
>> require('wp-load.php');
>> You'd also need to require the html dom file also.
>> 3. Then you'd parse the folders one by one using the PHP directory
>> functions http://php.net/manual/en/ref.dir.php
>> 4. You'd then get to the file itself, you'd read it in using HTML Dom
>> $html = file_get_html($filename);
>> 5. Then you'd use the library to extract the bits of the page by HTML
>> markup a bit like:
>> $content = $html->find('div[id=content]', 0)->innertext;
>> 6. Finally you'd feed this into wp_insert_posts() filling in the title
>> and so on from further HTML extractions using the library
>> 7. Loop over all your HTML files till this is done.
>> Would actually be fairly easy if they are in the same format.
>> Hope this is helpful,
>> All the best,
>> Alex
> wow.
> That is very helpful. I don't understand all of it, but it's a base to go on!
> The HTML files are mostly the same. I'm worried about these child HTML files, however. Basically, if there was an inline picture in the main HTML files, the site has used child HTML files to put a big version of the picture with a caption. Somehow those would have to imported into the correct posts as featured images, with some sort of custom field for the caption.
> Aside from that, the rest is pretty straightforward. The site is basically analysis or news. So has author, content, byline. Would need to be categorised with dates preserved. I'm not sure that there was any kind of tags. I think I saw a plugin that can auto create tags. The only other thing is that the site content was arranged by weekly output, and this needs to be preserved, somehow, into editions or issue numbers.
> When you say command line, you're talking about Terminal, in OSX, right?
> Thanks so much, Alex.
> best,
> JB
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers


Christopher Ross
WordPress Consulting, Design & Programming

Toronto - (416) 900-3731

What's New in WordPress? http://thisismyurl.com/new-wordpress

More information about the wp-hackers mailing list