Several people have noticed content from other Google bots showing up in our main web index, and are wondering… why/how does that happen? Last week I was at WebmasterWorld Boston and I talked about this issue there, but I’d like to do a blog post about Google’s crawl caching proxy, because some people have questions about it.
First off, let me mention what a caching proxy is just to make sure that everyone’s aware. I’ll use an example from a different context: Internet Service Providers (ISPs) and users. When you surf around the web, you fetch pages via your ISP. Some ISPs cache web pages and then can serve that page to other users visiting the same page. For example, if user A requests www.cnn.com, an ISP can deliver that page to user A and cache that page. If user B requests www.cnn.com a second later, the ISP can return the cached page. Lots of ISPs and companies do this to save bandwidth. For example, Squid is one web proxy cache that is free and common that a lot of people have heard of.
As part of the Bigdaddy infrastructure switchover, Google has been working on frameworks for smarter crawling, improved canonicalization, and better indexing. On the smarter crawling front, one of the things we’ve been working on is bandwidth reduction. For example, the pre-Bigdaddy webcrawl Googlebot with user-agent “Googlebot/2.1 (+http://www.google.com/bot.html)” would sometimes allow gzipped encoding. The newer Bigdaddy Googlebots with user-agent “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” are much more likely to support gzip encoding. That reduces Googlebot’s bandwidth usage for site owners and webmasters. From my conversations with the crawl/index team, it sounds like there’s a lot of head-room for webmasters to reduce their bandwith by turning on gzip encoding.
Another way that Bigdaddy saves bandwidth for webmasters is by using a crawl caching proxy. I maxxed out my PowerPoint skills to produce an illustration. As a hypothetical example, imagine if you participate in AdSense, Google fetches urls for our blog search, and Google also crawls your pages for its main web index. A typical day might look like this:
In this diagram, Service A could be Adsense and Service N could be blogsearch. As you can see, the site got 11 page fetches from the main indexing Googlebot, 8 fetches from the Adsense bot, and 4 fetches from blogsearch, for a total of 23 page fetches. Now let’s look at how a crawl cache can save bandwidth:
In this example, if the blogsearch crawl or AdSense wants to fetch a page that the web crawl already fetched, it can get it from the crawl caching proxy instead of fetching more pages. That could reduce the number of pages fetched down to as little as 11. In the same way, a page that was fetched for AdSense could be cached and then returned to if the web crawl requested it.
So the crawl caching proxy work like this: if service X fetches a page, and then later service Y would have fetched the exact same page, Google will sometimes use the page from the caching proxy. Joining service X (AdSense, blogsearch, News crawl, any Google service that uses a bot) doesn’t queue up pages to be include in our main web index. Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, service Y wouldn’t get the page from the caching proxy. Finally, note that the crawl caching proxy is not the same thing as the cached page that you see when clicking on the “Cached” link by web results. Those cached pages are only updated when a new page is added to our index. It’s more accurate to think of the crawl caching proxy as a system that sits outside of webcrawl, and which can sometimes return pages without putting extra load on external sites.
Just as always, participating in AdSense or being in our blogsearch doesn’t get you any “extra” crawling (or ranking) in our web index whatsoever. You don’t get any extra representation in our index, you don’t get crawled/indexed any faster by our webcrawl, and you don’t get any boost in ranking.
This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didn’t know it was live. That should tell you that this isn’t some sort of webspam cloak-check; the goal here is to reduce crawl bandwidth. Thanks to Greg Boser for noticing this, and thanks to Jensense for noticing that one of our online answers had stale info. The support team has updated that answer.