[wp-hackers] pulling a massive HTML site into Wordpress

John Black immanence7 at gmail.com
Mon Jun 6 09:15:36 UTC 2011


Hi there,

I have a gargantuan HTML based site I want to port to Wordpress. I'm talking 52,000 individual HTML pages, with a further 10,000 pages with minimal content (mostly pictures with captions) that are basically child pages of these main pages.

In addition to the 60,000 HTML pages in total, the site has around 52,000 images embedded across these pages.

Happily, there is some regularity across these pages, and it appears (from a quick look) that some elements (e.g., names of the authors, and extracts) could be ripped out along with the main content. Not all the archive is regularised, however. Site content goes back to 1998.

First, I know I'm going to need a solid strategy here. I kind of need help with that, for although I face this massive task, I'm actually a graphic designer, not a programmer.

I know I will need to design a Wordpress page or post structure to ultimately contain the various parts of these pages I want to rip.

But most of all, I need to know how to start.

1. I looked at the discussion had here on wp-hackers in February on "Porting static content". It gives a few leads, but I have questions.

2. I looked at the plugin Import HTML Pages. I need to look closer, but a first run attempt on a limited number of files failed. I'm working on a localhost MAMP set up.

3. The consensus from the earlier February discussion was that PHP Simple HTML DOM Parser (http://sourceforge.net/projects/simplehtmldom/) was a great tool. I looked it, but have no idea how to even start using this. I don't find documentation that speaks to someone as code illiterate as me. Can it be used on a localhost MAMP installation?

4. Someone else mentioned PhpQuery (http://code.google.com/p/phpquery/). I haven't the faintest idea how to use that either.

Basically, I need some kind of heads up here.

Can someone give a kind of overview of what is possible? Does anyone have any scripts they could share with me that might streamline this and save me from an impossible learning curve?

I include below some excerpts of the discussion in February. If anyone can elaborate further on some of that, I'd love to read more!


Thanks so much for reading, and sorry for not understanding code. My head works in another way.

best,
John Black

______________________


Bill Dennen dennen at gmail.com 
Wed Feb 23 01:29:35 UTC 2011

Previous message: [wp-hackers] Porting static content
Next message: [wp-hackers] Porting static content
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> I've used http://sourceforge.net/projects/simplehtmldom/ a number of times.

I've had good luck with that, and phpQuery.

http://code.google.com/p/phpquery/
phpQuery is a server-side, chainable, CSS3 selector driven Document
Object Model (DOM) API based on jQuery JavaScript Library

In our case, we first build a sitemap of the pages we wanted to
import. It's basically a hierarchy built as an mult-level, unordered
list. This allows us to maintain parent-child relationships between
pages when they are imported into WordPress. We loop over that
unordered list of links, scrape each page, and use phpQuery to select
different parts of the page based on jQuery selectors. We can also add
custom fields to these imported page using add_post_meta.

-Bill

______________________


Christopher Ross cross at thisismyurl.com 
Tue Feb 22 22:55:22 UTC 2011

Previous message: [wp-hackers] Porting static content
Next message: [wp-hackers] Porting static content
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Scot, this sounds like a perfect use of the wp_insert_post() function


	$post = array(
	  'comment_status' => 'closed',
	  'ping_status' => 'closed',
	  'post_author' => $authorID,
	  'post_category' => $cat,
	  'post_content' => mysql_real_escape_string($_POST['post_content']),
	  'post_excerpt' => mysql_real_escape_string($_POST['post_excerpt']),
	  'post_status' => $blogpoststatus,
	  'post_title' => $posttitle,
	  'post_type' => 'post',
	  'tags_input' =>  $_POST['post_tags']
	);
	$wpid = wp_insert_post($post);


I did a government site a while back with similar restrictions, after downloading the content to a directory using an offline viewer I simply ran RegEx on the content until I had 50,000 usable documents. After that, I simply ran a PHP script to pull in 100 pages at time, post to the WP database and move those files.

______________________


Keith P. Graham kpgraham at gmail.com 
Wed Feb 23 15:00:31 UTC 2011

Previous message: [wp-hackers] Porting static content
Next message: [wp-hackers] WordPress.org API
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I wrote a simple plugin called Instant-Content-plugin
http://wordpress.org/extend/plugins/instant-content-plugin to import
static files into posts. It uses text files, but you can play with the
code to get that the way you want, commenting out the code that
replaces crlf with br tags. I made it so it can import data from a zip
file of static pages, using the first line of each file as a title.

I've used this to import large amounts of static data into WP. Most
recently I have been using it to import pages from my older sites into
WP. I had several sites with hundreds of pages of content that just
grew over the years. Each page had a slightly different layout. I
loaded all the pages into Notebook++ and made global changes until I
got just bare content. I then zipped up the files and loaded them into
WP pages using the instant-content plugin.

Keith

______________________



Paul paul at codehooligans.com 
Tue Feb 22 22:49:58 UTC 2011

Previous message: [wp-hackers] Porting static content
Next message: [wp-hackers] Porting static content
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Scott. 

I've used http://sourceforge.net/projects/simplehtmldom/ a number of times. 

I'll send you a working script I use to suck down a site. 

P-








More information about the wp-hackers mailing list