[wp-hackers] pulling a massive HTML site into Wordpress

Baki Goxhaj banago at gmail.com
Mon Jun 6 10:23:38 UTC 2011


Well, if you don't know some PHP I don't know how you are going to do it,
but her is my advice.

1. Use http://simplehtmldom.sourceforge.net/ to pull content and from the
old site and map it accordingly to WordPress fields
2. Use a custom script to insert posts - like the one you quoted above that
makes use of wp_insert_posts() function
3. Import content e posts rather than pages as so much pages don't scale and
will kill your site

Good luck.

Kindly,

Baki Goxhaj
www.wplancer.com | proverbhunter.com | www.banago.info<http://proverbhunter.com>


On Mon, Jun 6, 2011 at 11:15 AM, John Black <immanence7 at gmail.com> wrote:

> Hi there,
>
> I have a gargantuan HTML based site I want to port to Wordpress. I'm
> talking 52,000 individual HTML pages, with a further 10,000 pages with
> minimal content (mostly pictures with captions) that are basically child
> pages of these main pages.
>
> In addition to the 60,000 HTML pages in total, the site has around 52,000
> images embedded across these pages.
>
> Happily, there is some regularity across these pages, and it appears (from
> a quick look) that some elements (e.g., names of the authors, and extracts)
> could be ripped out along with the main content. Not all the archive is
> regularised, however. Site content goes back to 1998.
>
> First, I know I'm going to need a solid strategy here. I kind of need help
> with that, for although I face this massive task, I'm actually a graphic
> designer, not a programmer.
>
> I know I will need to design a Wordpress page or post structure to
> ultimately contain the various parts of these pages I want to rip.
>
> But most of all, I need to know how to start.
>
> 1. I looked at the discussion had here on wp-hackers in February on
> "Porting static content". It gives a few leads, but I have questions.
>
> 2. I looked at the plugin Import HTML Pages. I need to look closer, but a
> first run attempt on a limited number of files failed. I'm working on a
> localhost MAMP set up.
>
> 3. The consensus from the earlier February discussion was that PHP Simple
> HTML DOM Parser (http://sourceforge.net/projects/simplehtmldom/) was a
> great tool. I looked it, but have no idea how to even start using this. I
> don't find documentation that speaks to someone as code illiterate as me.
> Can it be used on a localhost MAMP installation?
>
> 4. Someone else mentioned PhpQuery (http://code.google.com/p/phpquery/). I
> haven't the faintest idea how to use that either.
>
> Basically, I need some kind of heads up here.
>
> Can someone give a kind of overview of what is possible? Does anyone have
> any scripts they could share with me that might streamline this and save me
> from an impossible learning curve?
>
> I include below some excerpts of the discussion in February. If anyone can
> elaborate further on some of that, I'd love to read more!
>
>
> Thanks so much for reading, and sorry for not understanding code. My head
> works in another way.
>
> best,
> John Black
>
> ______________________
>
>
> Bill Dennen dennen at gmail.com
> Wed Feb 23 01:29:35 UTC 2011
>
> Previous message: [wp-hackers] Porting static content
> Next message: [wp-hackers] Porting static content
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
> times.
>
> I've had good luck with that, and phpQuery.
>
> http://code.google.com/p/phpquery/
> phpQuery is a server-side, chainable, CSS3 selector driven Document
> Object Model (DOM) API based on jQuery JavaScript Library
>
> In our case, we first build a sitemap of the pages we wanted to
> import. It's basically a hierarchy built as an mult-level, unordered
> list. This allows us to maintain parent-child relationships between
> pages when they are imported into WordPress. We loop over that
> unordered list of links, scrape each page, and use phpQuery to select
> different parts of the page based on jQuery selectors. We can also add
> custom fields to these imported page using add_post_meta.
>
> -Bill
>
> ______________________
>
>
> Christopher Ross cross at thisismyurl.com
> Tue Feb 22 22:55:22 UTC 2011
>
> Previous message: [wp-hackers] Porting static content
> Next message: [wp-hackers] Porting static content
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Scot, this sounds like a perfect use of the wp_insert_post() function
>
>
>        $post = array(
>          'comment_status' => 'closed',
>          'ping_status' => 'closed',
>          'post_author' => $authorID,
>          'post_category' => $cat,
>          'post_content' =>
> mysql_real_escape_string($_POST['post_content']),
>          'post_excerpt' =>
> mysql_real_escape_string($_POST['post_excerpt']),
>          'post_status' => $blogpoststatus,
>          'post_title' => $posttitle,
>          'post_type' => 'post',
>          'tags_input' =>  $_POST['post_tags']
>        );
>        $wpid = wp_insert_post($post);
>
>
> I did a government site a while back with similar restrictions, after
> downloading the content to a directory using an offline viewer I simply ran
> RegEx on the content until I had 50,000 usable documents. After that, I
> simply ran a PHP script to pull in 100 pages at time, post to the WP
> database and move those files.
>
> ______________________
>
>
> Keith P. Graham kpgraham at gmail.com
> Wed Feb 23 15:00:31 UTC 2011
>
> Previous message: [wp-hackers] Porting static content
> Next message: [wp-hackers] WordPress.org API
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> I wrote a simple plugin called Instant-Content-plugin
> http://wordpress.org/extend/plugins/instant-content-plugin to import
> static files into posts. It uses text files, but you can play with the
> code to get that the way you want, commenting out the code that
> replaces crlf with br tags. I made it so it can import data from a zip
> file of static pages, using the first line of each file as a title.
>
> I've used this to import large amounts of static data into WP. Most
> recently I have been using it to import pages from my older sites into
> WP. I had several sites with hundreds of pages of content that just
> grew over the years. Each page had a slightly different layout. I
> loaded all the pages into Notebook++ and made global changes until I
> got just bare content. I then zipped up the files and loaded them into
> WP pages using the instant-content plugin.
>
> Keith
>
> ______________________
>
>
>
> Paul paul at codehooligans.com
> Tue Feb 22 22:49:58 UTC 2011
>
> Previous message: [wp-hackers] Porting static content
> Next message: [wp-hackers] Porting static content
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Scott.
>
> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
> times.
>
> I'll send you a working script I use to suck down a site.
>
> P-
>
>
>
>
>
>
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>


More information about the wp-hackers mailing list