[wp-hackers] pulling a massive HTML site into Wordpress
immanence7 at gmail.com
Mon Jun 6 10:54:42 UTC 2011
On 6 Jun 2011, at 14:23, Alex Andrews wrote:
> SimpleDom isn't a program, its a library you pull into your PHP script.
> A potential script would do this:
> 1. It would run on the command line.
> 2. The first few lines would draw all the required WordPress functions
> in without loading a site. Also include HTMLDOM
> define('WP_USE_THEMES', false);
> You'd also need to require the html dom file also.
> 3. Then you'd parse the folders one by one using the PHP directory
> functions http://php.net/manual/en/ref.dir.php
> 4. You'd then get to the file itself, you'd read it in using HTML Dom
> $html = file_get_html($filename);
> 5. Then you'd use the library to extract the bits of the page by HTML
> markup a bit like:
> $content = $html->find('div[id=content]', 0)->innertext;
> 6. Finally you'd feed this into wp_insert_posts() filling in the title
> and so on from further HTML extractions using the library
> 7. Loop over all your HTML files till this is done.
> Would actually be fairly easy if they are in the same format.
> Hope this is helpful,
> All the best,
That is very helpful. I don't understand all of it, but it's a base to go on!
The HTML files are mostly the same. I'm worried about these child HTML files, however. Basically, if there was an inline picture in the main HTML files, the site has used child HTML files to put a big version of the picture with a caption. Somehow those would have to imported into the correct posts as featured images, with some sort of custom field for the caption.
Aside from that, the rest is pretty straightforward. The site is basically analysis or news. So has author, content, byline. Would need to be categorised with dates preserved. I'm not sure that there was any kind of tags. I think I saw a plugin that can auto create tags. The only other thing is that the site content was arranged by weekly output, and this needs to be preserved, somehow, into editions or issue numbers.
When you say command line, you're talking about Terminal, in OSX, right?
Thanks so much, Alex.
More information about the wp-hackers