[wp-hackers] Measuring "distance" between two web pages

David Anderson david at wordshell.net
Tue Sep 18 11:35:06 UTC 2012


This is a more general question than specifically WordPress, but perhaps 
someone will have an idea.

With WordShell (wordshell.net) we've got customers who manage enormous 
numbers of sites (mass hosting). They can't be bothered or can't afford 
to test each one individually after doing an update from 3.4.1 to 3.4.2.

WordShell presently visits the site home-page before and after 
alterations, to test for HTTP errors. But that's rather basic.

What I'd be interested in is if anyone knows of any tool or method for a 
generalised way of producing a score for a difference between two HTML 
pages. i.e. I download the HTML before, and again after - are they 
"probably" the same page? They can't be expected to be exactly the same; 
for one thing, the "Generator" meta tag will have changed. The page 
might show the present time, or have hidden timings in the HTML source, etc.

So, what I'm interested in is a way to produce a "distance" score 
between the two pages. Something so that we can say something like "if 
the difference is more than X, then there is a Y% chance that these are 
not the same page". By choosing for Y, I can then test on X. Then 
WordShell will put up a flag based on that statistical likelihood - "OY! 
There's a good chance that that update broke the page!".

This is quite interesting not just for maintaining WordPress but for any 
website, but I haven't come across anything like it - any ideas?

Many thanks,

WordShell - WordPress fast from the CLI - www.wordshell.net

More information about the wp-hackers mailing list