[wp-trac] [WordPress Trac] #10550: nofollow attribute added to comment_reply_link function

WordPress Trac wp-trac at lists.automattic.com
Fri Mar 18 10:06:10 UTC 2011


#10550: nofollow attribute added to comment_reply_link function
--------------------------+-----------------------
 Reporter:  seo-dave      |       Owner:
     Type:  defect (bug)  |      Status:  reopened
 Priority:  normal        |   Milestone:  3.1
Component:  Comments      |     Version:
 Severity:  minor         |  Resolution:
 Keywords:  needs-patch   |
--------------------------+-----------------------

Comment (by joelhardi):

 See my [http://core.trac.wordpress.org/ticket/16881#comment:9 comment 9]
 on #16881 ... I don't think you've got the distinctions between the robots
 exclusions standard (robots.txt and meta tag), canonical and nofollow not
 quite right.

  * in the nofollow spec, it's explicitly stated that it's not meant to
 stop search engines from indexing, so site owner's shouldn't use it for
 that. Search engines can crawl these URLs, the nofollow is just meant to
 be taken as advice to search engines on link value. Matt Cutts and others
 have also said not to misuse it for internal links, which brings us back
 to why seo-dave opened this ticket to start with.

  * canonical affects if and how search engines combine the rankings of
 specific URLs, but it is up to their discretion how they use it. (Because
 if you think for 5 seconds like an "evil SEO" you can imagine all sorts of
 nefarious ways to misuse it if search engines took it literally.)

  * the [http://www.robotstxt.org robots exclusion standard] is for
 stopping crawling. Note that "nofollow" as used by this standard !=
 [http://microformats.org/wiki/rel-nofollow#open_issues rel=nofollow]

 I feel that this bug was correctly closed, because !WordPress was misusing
 the nofollow attribute for internal links, which had bad consequences, and
 r16230 corrected it. I filed #16881 with a trivial patch to hit a few
 places that r16230 missed.

 So, now the problem is that we have the bad side effects that people's
 ?replytocom pages are (a) being crawled more and (b) being indexed more.

 Is (a) a problem? Well, crawlers like Googlebot generally are smart enough
 not to overload a site, so maybe not a big one. But, it definitely is
 something where you would want to advise spiders not to crawl these URLs,
 wasting everyone's cycles.

 Is (b) a problem? Yes, because these are non-canonical pages and we don't
 want them indexed or linked to externally (and becoming de-facto
 canonical). The canonical link tag may mostly reduce this problem but not
 totally solve it.

 The answer to both (a) and (b) is to apply the robots exclusion standard.
 By stopping crawling, you stop indexing.

 OK, so robots.txt is the obvious way to solve this, add the one line I
 posted earlier and you've fixed it for your site.

 But, what about !WordPress, for people who don't know about robots.txt?
 Some possible options:
  * manipulate robots.txt in the filesystem (like .htaccess). Or, when
 there is no robots.txt and URLs rewrite to index.php, !WordPress could
 return a robots.txt.

  * add the robots meta tag with "noindex,nofollow" to the ?replytocome
 pages. See my 2nd patch to #16881. This is an easy win, but these URLs
 will still be hit (but not indexed) since the page has to be at least
 partly downloaded for the tag to be read. Possibly not hit as often
 though, once Googlebot takes into account that they have "noindex" set.

  * I can't think of any third options I like. Javascript-only links, user-
 agent detection to not show bots bad things, and using POST to spoof links
 which makes me shudder. 307 redirect. Cookie. #!. All incomplete and
 complicated. Maybe someone else can come up with something?

 OK, end of brain dump. I think the robots meta tag (see #16881) is such an
 easy win that it makes sense. It solves the indexing problem, and once
 these URLs don't show up anymore when people log into Google Webmaster
 Tools or whatever, I think the visibility of this issue goes way down.
 People assume that Googlebot hitting these URLs is bad, but if it's
 happening infrequently and at 3 a.m., does it really matter?

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/10550#comment:15>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list