Robots.txt disallow
Eric Enge of Stone Temple Consulting recently interviewed Matt Cutts of Google. In this interview, Matt discussed how Google handles disallows in a robots.txt file. The most interesting comment I thought was this:
“…robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results.”
Maybe this is common knowledge among the SEO old-guard, but I found it quite insightful. Technically, there really isn’t a standard that the robots.txt file has to conform to. There is a robots.txt RFC, but even it states that the robots can basically do whatever they want and there isn’t a governing body saying that robots.txt has to meet certain criteria. But it is generally accepted that if a crawler is “legitimate”, it will abide by whatever a webmaster defines in their robots.txt. For example, I might say:
Disallow: /privatestuff*
Contextually, this is saying that no spider is allowed to visit my privatestuff directory. Perhaps it is a false, but natural extension to also assume that this means that /privatestuff should not be viewed by anyone that I, as the webmaster, don’t want to view it. But Matt is saying that not only could any pages within that directory appear on Google (to be viewed by virtually anyone) but that those pages would also have PR value!
What would be the point of using the Disallow attribute in robots.txt at this point? A webmaster can certainly use it for site-wide restrictions, but if they choose to go that route, then any internal linking to those pages could theoretically put them in the SERPs anyway. Not only would he/she run the risk of having those pages appear in the SERPs, but it gives anyone with sinister ideas a URL to scope out. The only way to truly prevent those pages from appearing in the SERPs is to use the robots metatag (noindex). Am I wrong here?
There are many legacy sites that have used the robots file as a hard and fast rule for robots, with the valid assumption that it will keep everyone out. And I’m sure there are many newbies that view it the same way.
Is this the end of robots.txt as a viable standard? Except for sitemap exposure, I believe so. Your thoughts on this would be appreciated!
BTW, as an interesting aside, the first result for a Google search for “robots.txt disallow” brings up the whitehouse.gov site. Hmmmm…..
Related SEO Posts:
