[wp-trac] [WordPress Trac] #16893: Stop or reduce crawling of comment reply ?replytocom URLs

Sat Mar 19 19:18:43 UTC 2011

#16893: Stop or reduce crawling of comment reply ?replytocom URLs
------------------------------------+------------------------------
 Reporter:  joelhardi               |       Owner:
     Type:  enhancement             |      Status:  new
 Priority:  normal                  |   Milestone:  Awaiting Review
Component:  Comments                |     Version:  3.1
 Severity:  normal                  |  Resolution:
 Keywords:  has-patch dev-feedback  |
------------------------------------+------------------------------

Comment (by joelhardi):

 Replying to [comment:1 hakre]:
 > Why nofollow?

 I'm assuming you mean in the robots meta tag?

 Well, a distinction should be made here between this and rel="nofollow" --
 the robots version predates the rel attribute and has a different meaning.
 (The potential for confusion has been noted in the
 [http://microformats.org/wiki/rel-nofollow#open_issues rel=nofollow spec]
 since forever.) The [http://www.robotstxt.org/meta.html robots meta
 version] actually means "do not scan this page for links to follow"
 whereas the rel attribute means "leave this link out of your link-value
 calculations" (i.e. PageRank). You probably knew that, I just thought I'd
 explain for whoever comes along since these tickets have veered way OT.

 So, in the robots version, we could do just "noindex" but leave out
 "nofollow" and the page would not be indexed, but the search engine
 crawler would still scan it for links to other pages to index.

 Anyway, I thought about it whether to include it, because right now there
 are no "bad" links on the ?replytocom pages. All the links are identical
 to those on the regular post/comment page, with the exception of cancel-
 comment-reply-link, and it only hrefs "${postURL}#respond".

 But, for the same reason there's also no reason to '''not''' include it --
 the links that would be followed have already been spidered one hop before
 getting to the ?replytocom page.

 I decided to include it because the goal is to eliminate or reduce
 crawling of these pages -- so the hope is that if they're
 "noindex,nofollow," smart search engines like Googlebot will adjust their
 crawl frequency of these URLs down, since they're of zero value. Possibly
 they could also cancel page downloading midstream when the HTTP response
 is larger than the TCP window size, reducing transfer (although this is so
 minor I don't think it's worth a big investigation).

 Whereas, if the pages aren't nofollow, bots can/should still scan them for
 links, so they're going to be crawled perhaps just as frequently as before
 (only, not be indexed).

 Hypothetically, if in the future !WordPress were to add some new, unique
 link to these ?replytocom pages (unlikely, I know), we could have to
 revisit this issue. However, I think such a link is more likely to be
 another functional, app-controller style URL (like "cancel comment reply")
 that we don't want crawled than something with unique high-value content
 that we do want followed. So, that also argues in favor of putting the
 "nofollow".

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/16893#comment:2>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software