[wp-trac] [WordPress Trac] #10550: nofollow attribute added to comment_reply_link function
WordPress Trac
wp-trac at lists.automattic.com
Fri Mar 18 10:06:10 UTC 2011
#10550: nofollow attribute added to comment_reply_link function
--------------------------+-----------------------
Reporter: seo-dave | Owner:
Type: defect (bug) | Status: reopened
Priority: normal | Milestone: 3.1
Component: Comments | Version:
Severity: minor | Resolution:
Keywords: needs-patch |
--------------------------+-----------------------
Comment (by joelhardi):
See my [http://core.trac.wordpress.org/ticket/16881#comment:9 comment 9]
on #16881 ... I don't think you've got the distinctions between the robots
exclusions standard (robots.txt and meta tag), canonical and nofollow not
quite right.
* in the nofollow spec, it's explicitly stated that it's not meant to
stop search engines from indexing, so site owner's shouldn't use it for
that. Search engines can crawl these URLs, the nofollow is just meant to
be taken as advice to search engines on link value. Matt Cutts and others
have also said not to misuse it for internal links, which brings us back
to why seo-dave opened this ticket to start with.
* canonical affects if and how search engines combine the rankings of
specific URLs, but it is up to their discretion how they use it. (Because
if you think for 5 seconds like an "evil SEO" you can imagine all sorts of
nefarious ways to misuse it if search engines took it literally.)
* the [http://www.robotstxt.org robots exclusion standard] is for
stopping crawling. Note that "nofollow" as used by this standard !=
[http://microformats.org/wiki/rel-nofollow#open_issues rel=nofollow]
I feel that this bug was correctly closed, because !WordPress was misusing
the nofollow attribute for internal links, which had bad consequences, and
r16230 corrected it. I filed #16881 with a trivial patch to hit a few
places that r16230 missed.
So, now the problem is that we have the bad side effects that people's
?replytocom pages are (a) being crawled more and (b) being indexed more.
Is (a) a problem? Well, crawlers like Googlebot generally are smart enough
not to overload a site, so maybe not a big one. But, it definitely is
something where you would want to advise spiders not to crawl these URLs,
wasting everyone's cycles.
Is (b) a problem? Yes, because these are non-canonical pages and we don't
want them indexed or linked to externally (and becoming de-facto
canonical). The canonical link tag may mostly reduce this problem but not
totally solve it.
The answer to both (a) and (b) is to apply the robots exclusion standard.
By stopping crawling, you stop indexing.
OK, so robots.txt is the obvious way to solve this, add the one line I
posted earlier and you've fixed it for your site.
But, what about !WordPress, for people who don't know about robots.txt?
Some possible options:
* manipulate robots.txt in the filesystem (like .htaccess). Or, when
there is no robots.txt and URLs rewrite to index.php, !WordPress could
return a robots.txt.
* add the robots meta tag with "noindex,nofollow" to the ?replytocome
pages. See my 2nd patch to #16881. This is an easy win, but these URLs
will still be hit (but not indexed) since the page has to be at least
partly downloaded for the tag to be read. Possibly not hit as often
though, once Googlebot takes into account that they have "noindex" set.
* I can't think of any third options I like. Javascript-only links, user-
agent detection to not show bots bad things, and using POST to spoof links
which makes me shudder. 307 redirect. Cookie. #!. All incomplete and
complicated. Maybe someone else can come up with something?
OK, end of brain dump. I think the robots meta tag (see #16881) is such an
easy win that it makes sense. It solves the indexing problem, and once
these URLs don't show up anymore when people log into Google Webmaster
Tools or whatever, I think the visibility of this issue goes way down.
People assume that Googlebot hitting these URLs is bad, but if it's
happening infrequently and at 3 a.m., does it really matter?
--
Ticket URL: <http://core.trac.wordpress.org/ticket/10550#comment:15>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list