Other Google fetching

Many people know about Googlebot, which is the official bot that our main web index uses. Recently, I did a post on the official Google webmaster blog about how to verify Googlebot. Now I want to fill in the picture for people who study bots obsessively. ๐Ÿ™‚ Check out this picture that I made in under a minute:

Google IP range

Lots of IP (internet protocol) addresses belong to Google. The Googlebot range (shown on the left) is what I’ve already posted about. That range is used for things like the main web crawl and fetches for Google News and our image search. In order to crawl from the same IP range as Googlebot, a system must respect robots.txt and our internal hostload conventions. That keeps Googlebot from crawling sites too hard. But not every page request from Google’s IP addresses is from Googlebot. For example, when I surf the web at Google, those requests also come from a Google IP address. So let’s talk about the right side of the picture: fetches that are not from the same IP range as Googlebot.

There are a few Google fetching systems that don’t fetch from the IP range that Googlebot uses. Most of the time it’s because a person can initiate a fetch (when a person is requesting a page, it doesn’t make sense to abide by robots.txt, which is designed for machines). Here are a few possible examples that I know of:
Google’s Language Translation tools come from a different IP.
Feedfetcher grabs feeds for Google Reader and personalized home pages.
– People using the Google Web Accelerator come from a different IP.
– Fetches from the Google Wireless Transcoder come from a different IP.
AdsBot-Google comes from a different IP.
– The Google AdWords Keyword Tool comes from a different IP.
– Visits from Google employees just surfing the web.

The systems mentioned above probably won’t fulfill the “backward DNS lookup + forward DNS lookup” technique for verifying the main web crawler bots. For these fetches coming from the right side of the diagram, you could check whether they’re coming from an IP address that’s registered to Google. To see the owner of a IP address, remember that you can enter the command-line

telnet whois.arin.net 43

and then enter the IP address to get more information about that subnet. If you have questions, I’ve already answered a few here.

27 Responses to Other Google fetching (Leave a comment)

  1. Interesting! It never occured to me that google engineers might themselves account for some of the traffic coming out of google! I guess I just always assumed you guys were working so hard you didn’t have time to surf… ๐Ÿ™‚

  2. Matt, one thing of interrest, whit what USER-AGENT are you surfing around? ๐Ÿ™‚ And please keep on surfing.

  3. i can see no ips in that image

  4. Out of curiosity, what goes into the space in the diagram inside of “Google IPs” but outside of “Googlebot IPs” and “Other IPs?”

  5. Is there a corelation between crawler IP and datacenter? Or is the crawler now completely separated from the datacenters (with the proxy layer)? Are there any plans to add any support for content-negotiation (like for “MultiViews”, or can I tell them to stuff it whenever they ask for help with indexing ๐Ÿ™‚ )?

  6. Hi Matt,

    Is there a possibility that blocking heavy traffic from Google IP’s that do not pass the โ€œbackward DNS lookup + forward DNS lookupโ€ technique would result in any sort of penalty or filter? I have captcha’s in place, but am wary that I might block wanted automated traffic. Thanks in advance and I appreciate all your hard work

  7. There is also a “Google-Sitemaps” bot which looks for a sitemaps authorization file. I first saw it in August 2006.

  8. Yeah… i always wanted to know what actually is going inside google bots..

    Thanks for sharing… dude..

    Manish

  9. sandossu, there isn’t meant to be IPs. It’s just to show that fetches from Google IPs can be from Googlebot, but they can also be from (for example) Googlers surfing the web. James, realistically I should have drawn a line down the middle to divide the space completely. But I made it in a minute. ๐Ÿ™‚

    Roger Balmer, I usually surf around with a normal user-agent. Except when I don’t, of course. ๐Ÿ™‚

    Yup Sean. Since it fetches in response to a person saying “Okay Google, come validate/check my sitemaps file,” I wouldn’t be surprised if it came from the non-Googlebot set of IPs.

    JohnMu, I think most crawling happens from one datacenter, and then those results come back, are indexed, and flow into the index at every data center. I don’t think MultiView-type stuff is planned right now.

    Joe, there would be a few things (e.g. Google Reader) that you might block. That’s your call–I just wanted everyone to have as much information as possible for making decisions.

  10. The google employees are an extremely tiny fraction of the surfers of the Web sites in the google search engine.
    lol.

  11. Hi Matt,

    (Sorry, re-asking a question I made against an older entry…)

    In your efforts to save bandwidth/trees/whales/etc, do you also make use of the Content-MD5 (and Content-Length) header to avoid fetching mirrored data? I happen to provide this for my (often large) multimedia exhibits and it would be nice if G could realise/guess that with the same URL suffix, length, last-mod date and domain that it does not need to fetch the mirrored content at all. This would mean you could still direct local users to their local mirrored copy but save us all lots of bandwidth/trees/karma/etc. (Obviously youโ€™d want to verify the content on a random sample to avoid black-hatteryโ€ฆ)

    Rgds

    Damon

  12. Matt,

    Ok, resisting the urge to comment on that fact that you like to pick on people with OCD who can also read raw access logs… you and I both know the whole reason you posted this is so that if your bosses mess up you can now put this post on http://computer.graphic.artist.jobs.com/ as your resume. ๐Ÿ™‚

    -Michael

  13. Matt,

    You guys need to quick tinkering and fix Google. I’m seeing more and more legit non duplicate pages go supplemental on larger sites. Smaller sites don’t stand a chance anymore.

    I miss the Google of a few years ago.

    Lately, it seems the Google Spere is cracking.

    And this is coming from someone who drives around with a Google License plate frame on his car!

  14. Thanks Matt,

    Can you publish a taxonomy of Google user agents sometime, or point me to one if it already exists? I found requests from the following UA in my logs this week, for example, verified as coming from Google. I would love to understand more about how our two companies are interacting:

    ‘Mozilla (Google Web Accelerator Cache Warmer; Google Desktop index update in progress[$~ESTIMATE~$])’

    — Chris

  15. DamonHD, if you’re talking about ETag, I don’t think we use that. But we do listen to the If-Modified-Since response from web servers, so if you can set that up, it will save you some bandwidth.

    spamhound, got e.g. a blog post with specifics you could point me toward?

  16. Man have I been obsessed with bots and Google IP addresses this week. First off I installed a mod rewrite to block all java and Python bots except legitimate known Google and Yahoo java based applications. I hope I have allowed all the right IP addresses for Google’s java applications. I’m watching my error logs for problems so I can allow additional Google IP’s if needed.

    Info on how to kill java scraper bots can be found at Webmasterworld here: http://www.webmasterworld.com/rv4.cgi?f=92&d=3111150&url=http://www.webmasterworld.com/forum92/4323.htm and http://www.webmasterworld.com/apache/3111150.htm
    The routine positively kills all bad java bots better than Round Up.

    I have a great topic for you using M&M’s, Jellybeans, or sushi. Removing content from Google’s index. If you use the Robots.txt file to disallow all dynamic content after it has been indexed, how long does it take Google to remove it. I realize it depends in part on how often the urls are crawled again, excluding crawl dates can you give us a clue?

    Maybe you could do a video on the whole topic of content removal. You could use sardines and let the your cat eat them one at a time, or a whole bowl at once for emergency removal!

  17. I posted bad urls for stopping java scrapers. Here they are again:
    http://www.webmasterworld.com/apache/3111150.htm

    http://www.webmasterworld.com/forum92/4323.htm

    My bad.

  18. Hi Matt,

    I do use If-Modified-Since, but this is a potential way of your being able to avoid downloading the 1st *redundant* copy of (say) an image ever…

    Get your folks to look at the Content-MD5 header definition.

    I don’t use ETag (knowingly), heck I don’t even understand it (yet) after ~10 years serving Web pages!

    Rgds

    Damon

  19. Hi Matt,

    May I ask a quick question?
    I operate a Canada business. Can I register a .com domain, use a DNS and web host in USA?
    If I do that will harm my ranking on google.ca?

    Leeko

  20. It is very pity that there no list of google bot ip’s adress… That will be very… mmm… good ๐Ÿ™‚

  21. Hello Folks,
    Could anyone tell me how google finds new website names?
    lets say there is a new website with brand new IP address in kazakhstan. How does googlebot find out about that?
    does it use all .kz IP range and does reverse dns lookup or simply it has the access to DNS database of Kazak Name registrar? or it has its own Bind server?
    Sincerely,
    Az

  22. I think this is the correct place to bring this up…

    We have several client’s sites in which our stats (visitor analysis)
    pages showed up in Google’s index. I submitted a
    sitemap (our script is set to not crawl the filesystem, but instead the
    webpages) and made sure that those pages/folders were not included. I also password protected the pages. They still show up in Google Webmaster Tools (sitemaps) section as an http error (401).

    I had to explain this first, but my real question is…

    Does Google toolbar use personal information or something because those
    pages are definitely not linked from anywhere. The only way they could
    get into Google’s index is if the Google toolbar or some other Google
    application was sending info from the address bar back to Google. The only time I ever go to those pages is by manually typing in the url.

    Love Google’s services, by the way. Not trying to expose anything…
    just very curious.

    Thanks,

    Tim

  23. Google is crawling my site continues yesterday 16k hits out of which 9k hits from one IP 66.249.65.202 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

  24. Found your page/blog via Altavista SE when searching for
    Google Web Accelerator Cache Warmer
    found in my log the “full UA” as
    “Mozilla (Google Web Accelerator Cache Warmer; Google-TR-4-GT)”
    the IP for that entry was 65.124.120.130 which was for some sort of financial business in Mass, US.

    Here are my “thoughts and ideas”.

    Immediately before these multiple lines with that “accelerator” I found that a person in the UK was trying to visit my site, of course “one bad apple ruined the whole bunch” so that I had the entire 81.x.x.x for the UK (and Europe) blocked. Likewise I don’t allow almost all of the APNIC group of IP’s…. too many email harvesters in Asia, thus many spam originate from Asia … first received line in emails cannot be forged… and seeing too much bad activity from Asia and Europe got most of those areas blocked. Unfortunate for innocent parties in those areas. (I block many US and most Canadian IP’s due to repeated bad activities whether its email spam or bad web activity.)

    Nevertheless, found a UK visitor trying and trying to view my site, of course 403’d. Then after many of those I found the “accelerator” UA entry for this Google tool. The IP for a business in Mass, USA. My guess is that this Google tool looks for a cached copy of your site sitting on any number of NON-SE CACHES and if it finds a copy of your site or the wanted page on a cache near to the viewer then such copy is served to the viewer.

    Good and bad. Here’s the good, first. That user agent is a real person, so a site owner can allow it or disallow. Site owner choice.

    The bad items now. It “in effect” hides the actual viewers IP, instead showing the IP of the cache site (which may vary). A site owner can decide to allow this anonymous viewing or decide to block it via user-agent of Google Web Accelerator Cache Warmer. A good part is that with this new user agent roaming sites, there is one advantage. If a site owner monitors their log file (monthly log file would be best but could be huge), simply search the log for that UA and jot down the IP numbers. That should give a listing of some “unknown” and possibly undesired caches around the country or the world. The ending part of the full UA after the semicolon, I believe “may vary”. Not sure if it is a version number or an indicator of the visitor’s origination point, but it might be interesting to note that portion with the IP list that would be in the results.

    Now, a good item. If that UA is allowed (at least at first) then a site owner can find many NON-SE CACHES around the country or world. Not knowing hold old they may allow their caches to become can be an issue. They might be serving VERY OLD copies of your site to visitors. The option is then to block those IP’s, or cidr groups for those IP’s especially if a cidr group is for a business. That might stop or block some of the “unknown” bots/crawlers.

    The latter has always been an issue with me. Unknown bots! Especially if it is a bot owned by EDU facilities or by businesses. You never know how they might use your content. And whether they have a name or not if they don’t have full info about their bots at their site and if they’re not a normal SE then most usually I don’t want them. They use my bandwidth, they could be infringing on my writings or copyright and they could be using code of my own design (I only did one “visitor counter” script) thus the latter is not a big issue.

    I believe this “info” may be correct about the Google accelerator, but it might take some verification by one or more people. Most definitely I “could not” see why someone at a business would view just one page of my site, but I can very well associate it with it being a “cache” source and that it was associated with the one site visitor that was blocked.

    Most probably I will allow it for a short term, just to watch some of its activity (does it obey robots.txt or fetch too fast) or maybe I will block it completely.

    Actually on an update…. it might be that the “accelerator” tool may have found that such Mass USA site was a super-high-speed source with less routing for getting to my server in southern california. Nevertheless, it does “hide” the visitor’s IP, and if that IP gets my pages they may in fact actually cache some of my pages.

    It could be an interesting study. Who owns those IP’s and where are they located. And what is the relationship or info at the end of the UA string, a program version number or origination indicator of the visitor?

    Best regards,

    (ps please don’t put my email onto ANY lists of any kind and please ensure that it is not “anywhere” in the blog code so that it can’t be harvested…. it “is” a disposable email but I really don’t want to dispose of it. thanks.)

  25. Won’t google’s feedfetcher check the robots.txt on the top level of the site? I do not want google to fetch any of the 200k+ entries that is available on a feed from me as it introduces more cpu load on an already taxed system. Trying to dig out the IP ranges to block is not good enough, really.

  26. And to add to my concern, it sounds like it will be able to create a DDoS situation where the source IP is one of the 100s of google IPs and there is no easy way to block it.

  27. I’ve made a regexp ๐Ÿ˜‰
    I don’t know if all ranges are here but I can say that all of that is google, and there are the main ranges.

    $ip =~ /^((64.68|66.249).(6[4-9]|[78]d|9[0-5])|64.233.(1[6-8]d|19[01])|216.239.(3[2-9]|[45]d|6[0-3]))./

    bye ๐Ÿ˜‰

css.php