How to verify Googlebot

At SES Jose, a few people asked for a way to authenticate Googlebot, so I talked to some folks on the crawl team and got an official way from them to verify Googlebot. Since it was official, I said “What the heck, why not throw it up on the official Google webmaster blog.

And so I did–the post is live here. Enjoy, and hope this is helpful to folks.

52 Responses to How to verify Googlebot (Leave a comment)

  1. Im sure i can find a use for that Matt

  2. Matt,

    Thank you, this is a WONDERFUL development.

    One fine detail overlooked that I would like clarification on is whether this also works 100% for the AdSense “Mediapartners-Google” crawler as well?

    A few Mediapartners-Google IP’s I spot checked appeared to respond correctly with googlebot.com but I didn’t want to leap to this assumption without official verification.

    Thanks,
    -Bill

  3. Jack from the Netherlands

    I’ve done the following.

    Each webserver runs its own DNS server as well, and I’ve created authoritative reverses for all Google/Yahoo/Inktomi/MSN/other spiders.

    The hostname I’ve chosen for all the IPs is exactly the same for each of the spiders, for example: google.spiders.example.net, msn.spiders.example.net and yahoo.spiders.example.net (*). These hostnames appear in my logs.

    In total, I’m authoritative internally for about 100 x /24 blocks. Each time a new block is used by a crawler, I add it manually to the config.

    It’s not only a great way to make your logs consistent (Miscrosoft uses non-existent TLDs, it’s terrible), it also saves a lot of reverse DNS queries when you have a lot of crawling on your servers (and use realtime Hostname Lookups). It speeds up the crawling process considerably, just like mod_gzip does.

    Again, these authoritative reverse DNS zones are only visible internally, not for the world.

    (*) where example.net is replaced by one of my core domains.

  4. Jack from the Netherlands

    Btw, for checking if an IP really belongs to Google, I’d recommend “whois” rather than DNS lookups. Also, because sometimes new blocks (yes, also Google blocks in the past) do not have any reverses from the start.

    $ whois 66.249.66.1

    OrgName: Google Inc.
    OrgID: GOGL
    Address: 1600 Amphitheatre Parkway
    City: Mountain View
    StateProv: CA
    PostalCode: 94043
    Country: US

    NetRange: 66.249.64.0 – 66.249.95.255
    CIDR: 66.249.64.0/19
    NetName: GOOGLE
    NetHandle: NET-66-249-64-0-1
    Parent: NET-66-0-0-0-0
    NetType: Direct Allocation
    NameServer: NS1.GOOGLE.COM
    NameServer: NS2.GOOGLE.COM
    Comment:
    RegDate: 2004-03-05
    Updated: 2004-11-10

    OrgTechHandle: ZG39-ARIN
    OrgTechName: Google Inc.
    OrgTechPhone: +1-650-318-0200
    OrgTechEmail: arin-contact@google.com

    # ARIN WHOIS database, last updated 2006-09-20 19:10
    # Enter ? for additional hints on searching ARIN’s WHOIS database.

  5. This doesn’t seem particularly great, DNS lookups and so on will be slow. Why not, instead, use Digital Signatures.

    Google can send a Digital Signature with each request it provides, then we can all confirm that it is, indeed, the googlebot by validating this message with the public key.

    Fast and easy. Doesn’t require slow DNS lookups…

  6. What’s the deal with blogger blogs, Matt? Why do the “links to this post” disappear after time, such as this

    Links to this post:

    How to verify Googlebot

    on your latest post at Webmaster Central. They’re only there for a limited amount of time and then disappear. Additionally, how do they get there in the first place? Is there a trackbacking system in place for blogger?

  7. While I agree that a double lookup is pretty definitive, I’ve also seen Google IP’s that don’t have reverse records … so hopefully the crawl team has synced this up with the DNS folks ….

    Ummmm … lets say someone does block (or show a “bad spider” page) to robots that don’t appear to be from Google … but if it turns out that it *was* a GoogleBot, then you could get penalized for cloaking – Arggggh, matey! (nice captcha addition Matt).

  8. @Jonathan
    You have to link to the Blogger post and click on that link, and then the your blog post link will show up. (Only click on the post page, not the home page.)

  9. Jack. the purpose of the DNS lookup is just to identify Googlebot within their domain which as Alexk mentions not all Google IPs even return a reverse DNS. Therefore, if you have a Google IP claiming to be Googlebot you can be sure it’s nobody spoofing Googlebot using translate.google.com, the web accelerator or any other Google proxy based service.

    Now you have 3 clues from Google to nail it down, WHOIS to actually make sure the IP is within Google’s range, and then reverse DNS and forward DNS to verify it’s Googlebot is itself on googlebot.com and not someone spoofing thru Google’s other proxy based services.

    Does that make a little more sense now?

  10. Jack, that’s a really neat idea about making your own reverse DNS zones. 🙂 But regarding whois, IncrediBILL makes a good point. IP ranges from whois are great, but that would also include (for example) employees surfing from corporate HQ or fetches from translate.google.com. The method I give is how to verify specifically Googlebots that can fetch for the main web index.

    IncrediBILL, I believe this will also apply to “Mediapartners-Google” because those bots can fetch into the same cache crawling proxy.

    Jonathan, I have to admit that I’m pretty Blogger illiterate–for now. 🙂

    alek, this should be an invariant for bots that can fetch for the main web index. I’ll try to do a more in-detail post that talks about some things like translate.google.com that fetch pages, but not from the same IP range as bots that can populate the main web index.

  11. @Haochi

    Okay, I do that, but why do they disappear?

    @Matt

    Ah bummer…is there someone I can contact to get some answers about this or a general contact I can use for any blogger questions I have?

    Lastly, what anti-spam plugin are you using here? I’m considering my options for anti-spam still on my blogs. Also, I hit back after not submitting the anti-spam stuff and it doesn’t hard refresh that numbers so I actually had to refresh the page after copying my comment to re-insert it (just fyi)

  12. Jack from the Netherlands

    Matt, don’t you worry, our internal DNS *also* includes those other zones, including some of those L3 nets that never resolve 😉

    Now re: whois. In fact, in my decision process, I’m using all tools: whois, dns lookups and apache UA logging. So, regard my posting on using whois as an addition to the googlewebmastercentral posting, in which I missed the (IMHO) essential whois check completely.

    (Matt, what t-shirt are you wearing today?)

  13. It seems to me to save overhead one would only want to do a DNS lookup of an IP when the “googlebot” was in the UA. When I get time I’ll have to figure out how to do this so that I can add this to my spam bot detection arsenal.

    Personally I’d still prefer to just know the IP ranges of Googlebot (e.g. they could be listed in Webmaster Central). A last updated date on the page would allow us to know if it has changed.

    Anything that help us detect fraudulent Googlebots and stop site scrapers not only helps us prevent our content from being usurped, but would also help reduce the amount of search spam Google has to contend with. So this is a welcome development.

  14. Matt,

    I was confused by your “invariant” response – all I was trying to say was to make really, really, really sure that your DNS folks are in sync with the crawler team. I.e. if the later adds a new group of spiders, one of the check-list items should be to actually confirm that all of the reverse DNS records are in place – since (as per the good suggestion), people will now use this to validate Googlebot.

    Better yet would be a hook in your name server software (I’m familier with Lucent’s QIP, but you guys probably rolled your own) that *enforced* this behavior at the time of entering the forward record. I.e. if “A record” =googlebot.com, then reverse record *must* resolve the same (independent of any other A records that use the same IP address). And don’t let someone later on change the reverse record by accident.

    IMHO, the double lookup would be very difficult to spoof, so should be adaquate by itself. Disadvantage (as pointed out by others) is that DNS is (relatively) slow … and you want to do the validation at the time of spidering (not as part of post-log analysis) … so some sort of caching scheme would be desireable – sounds like Jack from the Netherlands has one heck of a setup, but more than vast majority of webmasters would want to do.

    alek

    P.S. I haven’t looked in detail at my logs in a while, but I remember a year or so ago grepping out all the User-Agent=GoogleBots and then reverse looking up the IP’s – surprised to see on my piddly web site how many people set their UA to Googlebot.

  15. Thanks for thinking about this problem.

    The solution that I’ve always wished for involves a cookie and a sitemap account:
    – Google uses a “secret string” set as a (fake) cookie in Googlebot requests.
    – Google tells me this string through my SiteMaps account.
    – I can test for the cookie in Googlebot requests.

    It is simple, quick and easy to verify on the web server.

    IF cookie = “ads3k1k73” THEN it is Googlebot.

  16. alek, I’m saying the the DNS folks and the crawl folks are tightly in sync. If the backward DNS lookup and the forward DNS lookup method doesn’t work, then it’s a fetch from Google that is *not* from the IP range used for the main web index.

  17. OK – gotcha there – was just an operational issue that sprung to mind since if this fell through the cracks, it would be a very subtle failure as basically everything would continue to work except for those folks actually doing this test.

    Just to beat the horse truly to death, a simple QA test that Google’ers have probably already written is a script that periodically runs internally, snarfs all the spider IP’s, does the reverse/forward lookups, and flags any that somehow slipped through the cracks.

    alek

    P.S. Another interesting related approach would be similiar to what the Domain-Based BlackList guys do – basically convert an IP address into an “A” record and look that up … only one lookup required – read more at http://en.wikipedia.org/wiki/DNSBL

  18. Matt,

    Why does google not go after the people who use spiders that call themselves googlebot but actually are not googlebot? Seems like a trademark type of lawsuit. May be if google would start going after these people Google could actually shut them down and there would be a stop to it on the internet.

    It would be nice if Google would help all the webmasters in that way!!

  19. Sometimes making an example out of one or two can put a stop to a lot!!

  20. Why does google not go after the people who use spiders that call themselves googlebot but actually are not googlebot? Seems like a trademark type of lawsuit. May be if google would start going after these people Google could actually shut them down and there would be a stop to it on the internet.

    Because it would be way too difficult and expensive for Google to prosecute a civil case in Malaysia or Slovenia or Romania or Pakistan or (insert country of your choice here) against some punk kid with about $20 USD to his name officially and $20,000 buried in his mattress.

    Remember, in order to sue successfully, you not only have to win, you have to recover the judgement and costs. That’s not gonna happen here.

  21. Mr. Matt, thank you for informing us. My question is when will Google declare this publically? Are you telling this on the behave of google?

  22. alek, this was something that the crawl team mentioned that you could rely on, so I wanted to get the word out. But I think we’re open to other ways of doing it down the line.

  23. Well just speak to some search engine cloners they seem to know every google IP and when the googlebot comes….. but then again google can go with some other ip’s but they seem to be up to date!

  24. Chad SEO, I did the post on the Official Google Webmaster Central blog specifically so that people would get the word officially and publically.

  25. Why does google not go after the people who use spiders that call themselves googlebot but actually are not googlebot?

    Hopefully I won’t make Matt blush with this revelation, but sadly many times it’s ACTUALLY GOOGLEBOT crawling thru a proxy server or worse.

    Here’s an EXTREME example of this that happened to me recently:

    209.73.170.36 – – [05/Sep/2006:14:11:59 -0500] “HEAD / HTTP/1.1” 200 – “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
    209.73.170.36 – – [05/Sep/2006:14:11:59 -0500] “GET / HTTP/1.1” 200 1257 “-“”Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (via babelfish.yahoo.com)”

    The reverse IP is w21.search.scd.yahoo.com so it’s really Babelfish, but is it Google?

    So I looked at the extra detail I track and it claimed it was GoogleBot via a Yahoo proxy “Proxy FORWARD=66.249.65.18” which was actually crawl-66-249-65-18.googlebot.com going thru Babelfish.

    Googlebot doesn’t ask for a HEAD but Babelfish does ask for the HEAD first [checks for 404s or redirects which Babelfish won’t follow] before performing a GET to the page.

    So the whole thing checked out front to back that it was in fact Google translating my webpage via Babelfish and I would sure as heck like to know where that link came from!

    So, now you know many times it’s really Googlebot doing something bizarre and the new reverse/forward DNS trick let’s you easily generate a 403 Forbidden to stop this activity.

  26. Can the bots for Google Sitemaps and the domain verification also be verifified in this way? Do you sometimes do “manual checks” that might need to be allowed? There are a few people who would like to keep their sitemaps files private – it would be interesting to add “protection” through verification (unless that causes problems on your end).

  27. Hey Matt!

    This is off topic but on my mind lately.

    Lag times.

    Make a change on your site, Google sees the change, Google reports the change in cache (this appears to be about 2-3 days), Googles SERPs change (maybe).

    I know this is variable. Maybe you could give three examples of different types of sites and approximate or target lag times for each type.

    Ted Z

  28. TedZ — reduce your lag time, use Google Sitemaps. There’s a wonderful feature of Sitemaps, you can specify the last-change date of your pages. Combined with a “pinging” of Google to signal a change in the sitemap file, you have a way to get Google to concentrate on new and changed URLs with very little work (provided you have a good, perhaps automatic, Google Sitemap generator). It works like a charm in my tests.

  29. Hi Matt,

    Greetings from Turkey
    I have a question. I know that it is really irrelevant here, but I could not find anything that might help. I do not expect a special answer. A general commet would suffice.

    The problem is this: I have a specialized dictionary site with some 8000 technical terms with definitions and translation. Now, after the initial cache, google does not seem fetching the content although a code 200 is issued. And although the database is updated frequently (I am adding more dictionary entries almost daily), google never re-crawls the pages, it only checks the already existing ones.

    Googlebot only checks the page(s), get a “200 OK”, and fetches just 5 bytes. I receive literally 1000s of hits by googlebot every day with the same crawl pattern:

    /idioms/5416.html (1 to n.html)
    Http Code: 200 Date: Sep 23 11:54:28 Http Version: HTTP/1.1 Size in Bytes: 5
    Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    What does this size 5 bytes crawl mean? Is this another “304”? One for dynamic pages?

    Thank you

  30. This doesn’t seem particularly great, DNS lookups and so on will be slow. Why not, instead, use Digital Signatures.

    Google can send a Digital Signature with each request it provides, then we can all confirm that it is, indeed, the googlebot by validating this message with the public key.

    Fast and easy. Doesn’t require slow DNS lookups…

  31. Looks like you have some spam from SEOluv up there…are you going to remove it?

  32. Why wouldn’t you use a signature system? I don’t quite understand why that option would be ignored.

  33. SEOluv: I’m okay, thanks, I don’t view PDFs from untrusted sites (due to security issues).

  34. To avoid double lookups it would be helpful if the IP address resolved to a well-defined pattern, e.g. googlebot-dd-dd-dd-dd.googlebot.com; that way if the agent is Googlebot and the remote addres is dd.dd.dd.dd, only one lookup is needed to validate that this truely is a request from Googlebot.

  35. There is another problem with this method. Since Googlebot doesn’t carry sessions we’ll have to use this method every time Google hits one of the pages on the website. A step such as this one takes approximately 1 second to respond and also takes up a little bit of resources. Now take that 1 second and multiply it by thousands along with the resources and we’ve got a small problem.

    I say for those who are really worried about making sure they keep other bots out of their sites we do this.

    Within our Google sites account we set a specific field and password we would like Googlebot to pass to us once they hit our page. Save the IP of the bot checking into the database and let it crawl for as long as it wants.

  36. only one lookup is needed to validate that this truely is a request from Googlebot

    OK, I’ll repeat, REVERSE DNS can be easily be SPOOFED to say anything.

    The REVERSE DNS will tell you it’s googlebot.com and the FORWARD DNS will confirm it resolves to the proper IP address because foward looking up googlebot.com can’t be easily spoofed.

    And whats the big deal of a double lookup?

    You cache the information in a file, let’s say a folder /GOOGLE-CACHE/ and the file names are the the IP addresses or some such, and you only do a double lookup every 24 hours or so, so you aren’t pounding on the DNS except the first lookup per IP.

  37. Matt,
    sorry for the unrelated post but can you please clarify whether Google indexes an iframe or not? Will its content be considered as part of the page the iframe is displayed?

    thanks,
    iPod 🙂

  38. Matt,
    I’m just going to come out and say it outright. This solution that has been provided by crawl team seems as though simply by applying a simple if statement could be used as a very powerful cloaking tool. I’m sure you guys have thought about this in great detail, but has Google changed its perception of cloaking? I can think of several valid reasons why cloaking could/ should be allowable, but there has not been (to my knowledge) any officially approved cloaking techniques for any circumstances. Has Google changed their position on cloaking for the purpose of protecting, say, catalog sku’s?

    Best Regards

  39. Flicked the Big Red Switch over at The Dalles recently?

    209.85.129.*
    209.85.135.*
    209.85.143.*

    🙂

  40. Well said on the last couple posts, IncrediBILL. We try to avoid crawling through things like proxies/mirrors, but it can happen, and you give one way to prevent it from being a problem.

    JohnMu, I’m about to do a post about just that. My guess is that Sitemaps verification doesn’t come from the main Googlebot IP range, but I’m not 100% sure. In general, when a human can initiate a fetch, that fetch rarely comes from the official Googlebot IP range.

    Mike, we could offer digital signatures down the road (maybe), but I found out about this way to verify things, so I wanted to get it down.

    walkman, I don’t believe we index iframe contents, but I’m not 100% sure.

    J-man, nope, we haven’t changed our policy on cloaking at all.

  41. Thanks for telling I am still wondering about that one!

    Akash

  42. Not every Googlebot can be verified like this. See my blogpost at http://lunchpauze.blogspot.com/2006/09/googles-evil-crawler.html – it shows you how we caught a misbehaving crawler from Google, which pretends to be Internet Explorer 6 on Windows XP.
    Matt, can you explain it?

  43. Walkman,
    I have a site that used iframes, and no it did not do well at all, the cache of the home page had no content in it, I have since dumped iframe, I had no luck with making it SE friendly for any of the engines.

    Best Regards,
    J-Man

  44. thanks guys. I still am not sure on iframes; I guess I’ll still wait. My bread and butter site is out of G’s favor, and I am afraid that rotating /constantly content is the issue. For each page I had a “related” products section that picked X random ones from the same category, but after listening to Matt’s interview, I think a page that changes each time is loaded does not send a good signal to Google. Oh well….wait and see.

  45. @Robin – Why would you accuse Google of being a stealth crawler @ McColo? Most people block McColo because of all the bad bot activity, and it’s NOT Google. Sheesh.

    Lot’s of ways it could happen, even a proxy server feeding Google a cloaked link to your site (read my posts above) but they aren’t crawling from those IPs or we would all know about it.

    Besides, if Google wanted to checked to see if a web site was cloaking data, they would only have to spot check a few pages, not a full crawl, it wouldn’t be efficient to do a complete crawl to establish if a site was cloaking.

  46. Yesterday we spotted a 2nd crawler from this ip range on one of our sites. I’m not cloacking at my sites, so I don’t think that is the bot’s intention.
    I’ve read your posts above, but still can’t see how pages appear in the Google cache with that IP address (“Your IP: xxx.xxx.xxx.xxx”). I did a full portscan on that host, but no proxies are running. And even if there was an open proxy running the useragent string would contain “Googlebot”, and the IP address would be listed in the open proxy databases. (or else it would be a really odd proxy to replace the useragent string with IE6/WinXP).

  47. Hi Matt,

    I instituted the double lookup and am now banning all bots that don’t pass the test. I run an ecommerce site with thousands of products and have been the victim of scrapers who copy and repost my products on their pages. The question I have is that within 2 days of implementing bot traps, all but a handful of our pages have gone supplemental or disappeared and I’m wondering if the bot traps may be the cause, or purely coincidence. The majority of SERPs remaining are old links to pages that 301 to new “friendly” URLs. Googlebot is still visiting each day and scans thousands of pages each visit, but recently started hitting old URLs that were 301’d months ago. Any advice would be greatly appreciated.

  48. And even if there was an open proxy running the useragent string would contain “Googlebot”, and the IP address would be listed in the open proxy databases.

    Robin, not all proxies are just open ports like you’re attempting to find. Some are CGI or PHP proxy sites, and they don’t all pass the user agent thru, or you can configure the user agent. Some CGI proxy sites such as Proxify by default will tell the world you’re using Safari on a Mac, so on and so forth.

    Probably someone just running a stealth crawler, not Google (I’m still wondering how you concluded it was Google) but armed just with the IP address you reference, ARIN.NET claims it’s in a block of IP’s assigned to CMP Media LLC and it’s also a banned IP on the Twiki blacklist:
    http://twiki.pula.org/cgi-bin/twiki/view/TWiki/BlackListPlugin

    Pretty sure it’s not Google 😉

  49. Matt, thanks for providing clarity on these questions.

    We do use WHOIS as the first pass, it’s more reliable than DNS lookups for one. For anyone who thinks this is all meaningless, with one site I’m working with, more than 95% of the “Googlebot” visits on a daily basis fail the WHOIS test.

    So we really on need to fool with DNS lookups on the 5% that pass a WHOIS.

    The process (generified since Googlebot isn’t the only spider) is:
    1) We have a user-agent identifying itself as a spider
    2) Check the IP vs our database of good and bad IPs for that spider
    3) Use WHOIS for anything that we can’t look up
    4) Use DNS to verify those that pass the WHOIS test, unless it’s MSN
    4a) (if MSNBot), cross your fingers and hope you didn’t block the real bot

    Robin, don’t assume that those proxies would appear in any published database, respond on a predictable port, etc. If it is a public proxy, don’t assume that the IP making the request is the same as the public proxy address. Some of the biggest public proxies run multiple servers, and the IP fetching pages is not the public WWW server.

  50. Dear colleagues,

    From now on, you can Check/verify googlebot ip’s very easy.
    Here you can verify ip addresses:
    http://www.2dwebdesign.nl/googlebotchecker.php

    Notice that this tool makes use of the reverse DNS technique which google has recommended.
    This tool also distinguishes all google ip’s from googlebot ip’s.

    Regards,
    Houman

  51. Thought a lot before placing this on a dead and burried blog entry but here it goes. I guess someone might benefit from it. If it’s too l8 just drop the comment 🙂

    With all the buzz about http://www.tellinya.com/read/2007/09/09/defend-your-website-against-google-proxy-hijacking/ I put my previously coded PHP class to good use and now filters my website traffic. And works very well.

    PS:I am curious if once an IP verifies as Googlebot would it be safe to consider the entire C class as Google’s???

    Thanks.

  52. is there any way to detect googlebot using php?

css.php