[wp-trac] [WordPress Trac] #16893: Stop or reduce crawling of comment reply ?replytocom URLs

WordPress Trac wp-trac at lists.automattic.com
Sat Mar 19 06:37:02 UTC 2011


#16893: Stop or reduce crawling of comment reply ?replytocom URLs
-------------------------+------------------------------------
 Reporter:  joelhardi    |      Owner:
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Awaiting Review
Component:  Comments     |    Version:  3.1
 Severity:  normal       |   Keywords:  has-patch dev-feedback
-------------------------+------------------------------------
 (For full background, you can check out the off-topic comments at the ends
 of #10550 and #16881.)

 r16230 (quite appropriately) removed the `rel="nofollow"` attribute from
 the <a href="?replytocom">Reply to this comment</a> links in the comments
 display. Since then, users have reported search engines are now crawling
 these pages (as one would expect). This means unnecessary server overhead
 (these pages are almost always dynamically generated even when using a
 caching plugin) and may reduce the frequency search engines crawl "real"
 pages since there are so many of these dummy URLs to index.

 Additionally, there may be SEO-related reasons why this is bad. Although
 that may be largely mitigated by `rel="canonical"` and the fact that
 contents of these pages are 99.9% a duplicate of their canonical versions,
 search engines are also known to penalize sites for having many pages with
 duplicate content. And, the specification for the canonical attribute
 states that its use in page-ranking calculations is at the discretion of
 indexers (i.e., using canonical is no deterministic guarantee of
 anything).

 I'm attaching 2 patches to be considered/discussed separately:

 == 1: robots meta tag ==

 The first applies the [http://www.robotstxt.org/ robots exclusion
 standard] to these URLs.

 For individual site admins, putting `Disallow: *?replytocom` in your
 robots.txt is the obvious fix. #11918 would have added this rule to what
 do_robots() returns, but this was dropped (but that was way before
 r16230). I would support putting it back in.

 My patch general-template.php.17522.diff would add the `<meta
 name='robots' content='noindex,nofollow' />` tag (already used when the
 !WordPress privacy setting is enabled) to the ?replytocom URLs. The pages
 would still be hit by crawlers but would no longer be indexed -- so, 100%
 addressing the duplicate indexing issue and rendering moot debates over
 the effectiveness of "canonical."

 Compared to robots.txt, this is an imperfect solution to the
 crawling/server overhead problem because these pages would still have to
 be at least partially retrieved to read the meta tag. I expect that it
 will still have a beneficial effect in this area, because well-behaved
 crawlers will reduce the frequency that they retrieve these noindex URLs.

 I've got this code deployed on some live sites to test this idea out (but
 they previously had a robots.txt rule blocking ?replytocom, so I don't
 expect instantly useful info).

 Even if it's not 100% effective, I think this is an improvement and
 trivial enhancement that could go into core soon. (this is the dev-
 feedback item.)

 == 2: change links to forms ==

 Picking up on an idea filosofo had to replace the <a> links with forms,
 I've done that with comment-template.php.17522.diff.

 Functionally, this is a drop-in replacement, since these are GET forms
 that produce the same HTTP request as the current <a> tags (please, try
 them out!).

 Although Google has been
 [http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-
 forms.html crawling forms] for some time, it's only been on an
 experimental basis and isn't widespread. It also does not affect crawler's
 page selection or the search engine ranking in any way. So, changing these
 links to GET forms should sharply reduce (but possibly not eliminate)
 crawling of URLs. I don't have any data for this so it merits further
 discussion.

 One implementation issue would be that since the <a> tags are now <button>
 elements, it's going to affect themes. That alone could be a deal-killer!
 I chose the button element over input because it's easier to style in a
 cross-browser way (i.e. to look just like a link), but defer to the UI
 folks on that.

 So ... this 2nd idea isn't necessarily fully cooked and would need broader
 support.

 And, someone may have an even better way to address the overall issue.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/16893>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list