[wp-trac] [WordPress Trac] #16893: Stop or reduce crawling of comment reply ?replytocom URLs
WordPress Trac
wp-trac at lists.automattic.com
Sat Mar 19 19:18:43 UTC 2011
#16893: Stop or reduce crawling of comment reply ?replytocom URLs
------------------------------------+------------------------------
Reporter: joelhardi | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Comments | Version: 3.1
Severity: normal | Resolution:
Keywords: has-patch dev-feedback |
------------------------------------+------------------------------
Comment (by joelhardi):
Replying to [comment:1 hakre]:
> Why nofollow?
I'm assuming you mean in the robots meta tag?
Well, a distinction should be made here between this and rel="nofollow" --
the robots version predates the rel attribute and has a different meaning.
(The potential for confusion has been noted in the
[http://microformats.org/wiki/rel-nofollow#open_issues rel=nofollow spec]
since forever.) The [http://www.robotstxt.org/meta.html robots meta
version] actually means "do not scan this page for links to follow"
whereas the rel attribute means "leave this link out of your link-value
calculations" (i.e. PageRank). You probably knew that, I just thought I'd
explain for whoever comes along since these tickets have veered way OT.
So, in the robots version, we could do just "noindex" but leave out
"nofollow" and the page would not be indexed, but the search engine
crawler would still scan it for links to other pages to index.
Anyway, I thought about it whether to include it, because right now there
are no "bad" links on the ?replytocom pages. All the links are identical
to those on the regular post/comment page, with the exception of cancel-
comment-reply-link, and it only hrefs "${postURL}#respond".
But, for the same reason there's also no reason to '''not''' include it --
the links that would be followed have already been spidered one hop before
getting to the ?replytocom page.
I decided to include it because the goal is to eliminate or reduce
crawling of these pages -- so the hope is that if they're
"noindex,nofollow," smart search engines like Googlebot will adjust their
crawl frequency of these URLs down, since they're of zero value. Possibly
they could also cancel page downloading midstream when the HTTP response
is larger than the TCP window size, reducing transfer (although this is so
minor I don't think it's worth a big investigation).
Whereas, if the pages aren't nofollow, bots can/should still scan them for
links, so they're going to be crawled perhaps just as frequently as before
(only, not be indexed).
Hypothetically, if in the future !WordPress were to add some new, unique
link to these ?replytocom pages (unlikely, I know), we could have to
revisit this issue. However, I think such a link is more likely to be
another functional, app-controller style URL (like "cancel comment reply")
that we don't want crawled than something with unique high-value content
that we do want followed. So, that also argues in favor of putting the
"nofollow".
--
Ticket URL: <http://core.trac.wordpress.org/ticket/16893#comment:2>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list