Googlebot: Keep out!
Okay, that last post was pretty earnest, so I feel the need to post something really technical now. At SES New York, someone asked “Why don’t you provide a parameter, like ‘?googlebot=nocrawl’ to say ‘Googlebot, don’t index this page’?”
That was a pretty good question. The short answer would be that on pages you don’t want indexed by spiders, you can add this meta tag to the page:
<META NAME=”ROBOTS” CONTENT=”NOINDEX”>
You can read more about the noindex and nofollow meta tags on our webmaster pages.
But the user specifically wanted a url parameter. I mentioned that because the parameter “id” is often used for session IDs, Googlebot used to avoid urls with “?id=(let’s say a five digit or larger number)” but that I didn’t know if that was still true. I think someone else nearby asked “Isn’t that kind of an ugly hack though?” and I had to fall back on “You asked for something that worked, not something that was pretty.” The questioner persisted, but I was out of other ways to do it, so I said I’d pass the feedback on, namely “someone wants a url parameter that’s keeps Googlebot from indexing the page.”
That question came up again today, and I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like ‘*’ in robots.txt). Here’s how:
1. Add the parameter like ‘http://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’ to pages that you don’t want fetched by Googlebot.
2. Add the following to your robots.txt:
User-agent: Googlebot
Disallow: *googlebot=nocrawl
That’s it. We may see links to the pages with the nocrawl parameter, but we won’t crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn’t ever fetch the page.
Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. I’ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadn’t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones I’ve mentioned so far) let Google into more of their pages these days.
That’s why we might show uncrawled urls in response to a query, even if we can’t fetch a url because of robots.txt. So ‘googlebot=nocrawl’ pages might show up as uncrawled. The two preferred ways to have the pages not even show up in Google would be A) to use the “noindex” meta tag that I mentioned above, or B) to use the url removal tool that Google provides. I’ve seen too many people make a mistake with option B and shoot themselves in the foot, so I would recommend just going with the noindex meta tag if you don’t want a page indexed.
Obscure note #2: You might think that the robots.txt that I gave would block a url only if it ends in ‘googlebot=nocrawl’, but in fact Google would match that parameter anywhere in the url. If (for some weird reason), you only wanted to block a url from crawling if ‘googlebot=nocrawl’ was the last thing on the line, you could use the ‘$’ character to signify the end of the line, like this:
User-agent: Googlebot
Disallow: *googlebot=nocrawl$
Using that robots.txt would block the url
http://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl
but not the url
http://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl&option=value.
If you hung on all the way to the end of this post, good for you! You know stuff that most people don’t know about Google now. If you want to try other experiments with robots.txt without any risk at all, use our robots.txt checker built into Sitemaps. It uses the same logic that the real Googlebot uses; that’s how I tested the stuff above.





