Many people know about Googlebot, which is the official bot that our main web index uses. Recently, I did a post on the official Google webmaster blog about how to verify Googlebot. Now I want to fill in the picture for people who study bots obsessively. 🙂 Check out this picture that I made in under a minute:
Lots of IP (internet protocol) addresses belong to Google. The Googlebot range (shown on the left) is what I’ve already posted about. That range is used for things like the main web crawl and fetches for Google News and our image search. In order to crawl from the same IP range as Googlebot, a system must respect robots.txt and our internal hostload conventions. That keeps Googlebot from crawling sites too hard. But not every page request from Google’s IP addresses is from Googlebot. For example, when I surf the web at Google, those requests also come from a Google IP address. So let’s talk about the right side of the picture: fetches that are not from the same IP range as Googlebot.
There are a few Google fetching systems that don’t fetch from the IP range that Googlebot uses. Most of the time it’s because a person can initiate a fetch (when a person is requesting a page, it doesn’t make sense to abide by robots.txt, which is designed for machines). Here are a few possible examples that I know of:
– Google’s Language Translation tools come from a different IP.
– Feedfetcher grabs feeds for Google Reader and personalized home pages.
– People using the Google Web Accelerator come from a different IP.
– Fetches from the Google Wireless Transcoder come from a different IP.
– AdsBot-Google comes from a different IP.
– The Google AdWords Keyword Tool comes from a different IP.
– Visits from Google employees just surfing the web.
The systems mentioned above probably won’t fulfill the “backward DNS lookup + forward DNS lookup” technique for verifying the main web crawler bots. For these fetches coming from the right side of the diagram, you could check whether they’re coming from an IP address that’s registered to Google. To see the owner of a IP address, remember that you can enter the command-line
telnet whois.arin.net 43
and then enter the IP address to get more information about that subnet. If you have questions, I’ve already answered a few here.