Crawl caching proxy

Several people have noticed content from other Google bots showing up in our main web index, and are wondering… why/how does that happen? Last week I was at WebmasterWorld Boston and I talked about this issue there, but I’d like to do a blog post about Google’s crawl caching proxy, because some people have questions about it.

First off, let me mention what a caching proxy is just to make sure that everyone’s aware. I’ll use an example from a different context: Internet Service Providers (ISPs) and users. When you surf around the web, you fetch pages via your ISP. Some ISPs cache web pages and then can serve that page to other users visiting the same page. For example, if user A requests www.cnn.com, an ISP can deliver that page to user A and cache that page. If user B requests www.cnn.com a second later, the ISP can return the cached page. Lots of ISPs and companies do this to save bandwidth. For example, Squid is one web proxy cache that is free and common that a lot of people have heard of.

As part of the Bigdaddy infrastructure switchover, Google has been working on frameworks for smarter crawling, improved canonicalization, and better indexing. On the smarter crawling front, one of the things we’ve been working on is bandwidth reduction. For example, the pre-Bigdaddy webcrawl Googlebot with user-agent “Googlebot/2.1 (+http://www.google.com/bot.html)” would sometimes allow gzipped encoding. The newer Bigdaddy Googlebots with user-agent “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” are much more likely to support gzip encoding. That reduces Googlebot’s bandwidth usage for site owners and webmasters. From my conversations with the crawl/index team, it sounds like there’s a lot of head-room for webmasters to reduce their bandwith by turning on gzip encoding.

Another way that Bigdaddy saves bandwidth for webmasters is by using a crawl caching proxy. I maxxed out my PowerPoint skills to produce an illustration. 🙂 As a hypothetical example, imagine if you participate in AdSense, Google fetches urls for our blog search, and Google also crawls your pages for its main web index. A typical day might look like this:

Page fetches under the old crawl

In this diagram, Service A could be Adsense and Service N could be blogsearch. As you can see, the site got 11 page fetches from the main indexing Googlebot, 8 fetches from the Adsense bot, and 4 fetches from blogsearch, for a total of 23 page fetches. Now let’s look at how a crawl cache can save bandwidth:

A crawl cache is much smarter!

In this example, if the blogsearch crawl or AdSense wants to fetch a page that the web crawl already fetched, it can get it from the crawl caching proxy instead of fetching more pages. That could reduce the number of pages fetched down to as little as 11. In the same way, a page that was fetched for AdSense could be cached and then returned to if the web crawl requested it.

So the crawl caching proxy work like this: if service X fetches a page, and then later service Y would have fetched the exact same page, Google will sometimes use the page from the caching proxy. Joining service X (AdSense, blogsearch, News crawl, any Google service that uses a bot) doesn’t queue up pages to be include in our main web index. Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, service Y wouldn’t get the page from the caching proxy. Finally, note that the crawl caching proxy is not the same thing as the cached page that you see when clicking on the “Cached” link by web results. Those cached pages are only updated when a new page is added to our index. It’s more accurate to think of the crawl caching proxy as a system that sits outside of webcrawl, and which can sometimes return pages without putting extra load on external sites.

Just as always, participating in AdSense or being in our blogsearch doesn’t get you any “extra” crawling (or ranking) in our web index whatsoever. You don’t get any extra representation in our index, you don’t get crawled/indexed any faster by our webcrawl, and you don’t get any boost in ranking.

This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didn’t know it was live. 🙂 That should tell you that this isn’t some sort of webspam cloak-check; the goal here is to reduce crawl bandwidth. Thanks to Greg Boser for noticing this, and thanks to Jensense for noticing that one of our online answers had stale info. The support team has updated that answer.

72 Responses to Crawl caching proxy (Leave a comment)

  1. Ugh. What a lot of text that was. I’m going to get out of the house for a few hours.

    BTW, for the really really interested:
    Shoemoney was the first person I saw to write up what I said at WMW Boston:
    http://www.shoemoney.com/2006/04/18/matt-cutts-confirms-media-bot-crawling-for-big-daddy/
    I noticed that Shoemoney just did a new post too: http://www.shoemoney.com/2006/04/23/googles-bigdaddy-and-the-bots-tests-and-clarification/

    I also noticed Greg had a follow-up post:
    http://google.webguerrilla.com/adsense-bot-part-2/
    And Jen did a follow-up post too:
    http://www.jensense.com/archives/2006/04/matt_cutts_conf.html

    Greg asks why we didn’t mention it was live; I cover that in my post that I didn’t realize it was live until this past week. I think the only reason that he noticed was because he was doing different things for different bots (AdSense vs. Googlebot). 🙂 Once Greg noticed it, Jen looked through her logs and used server logs to say “this page was fetched by AdSense at this certain time” plus “the cached time/date on this page is the same” to conclude that pages fetched originally by AdSense made it into the index. Then ThomasB noticed that you could search for phpinfo() pages that mention the AdSense mediabot to verify the same thing. Which I thought was a smart idea.

    Jen has been covering this at JenSense and also at SEW:
    http://blog.searchenginewatch.com/blog/060417-020635
    was the original article, and
    http://blog.searchenginewatch.com/blog/060419-094701
    is a follow-up. I added the details spelling out that the crawl caching proxy is not the same as our cached pages in our web index because of the postscript in that later post.

    And that’s the main mentions that I know of. Hope that helps give a lot of background context and doesn’t bore everyone to death. I’m happy to answer questions (I’d read the background links first so that you’re up to speed on the context), but the high-order point is that being in AdSense doesn’t help you in our web index.

  2. Good work on clearing that one up Matt.

    It would be good if the page:
    http://www.google.com/webmasters/bot.html

    Was updated with the new useragent, and the new crawling techniques – this would help alot (Perhaps even your image could go up there!). As well as the other bots listed would be A+.

  3. Oh and this page needs updating:
    http://www.google.com/webmasters/remove.html

  4. Matt, Thanks for the clarification, but I have just one other question about Adsense possibly affecting page rank.

    I have noticed that if I do a Google Web search for the URL of a site I’m tracking, I will see results in “Find web pages that contain the term [mywebsite.com]” and these results will show pages that contain my AdSense ads. And my AdSense ads obviously link back to the site I’m tracking.

    Wouldn’t it be that the more pages that were shown in these results equal more links??? Also, why does Google Adsense ads even show up in a search for a URL in the first place?

  5. Extremely helpful, thanks. Bigger sites worry about bandwidth a lot (remember the webmasterworld anti robots policy) and I really think this will help. It also explains the recent Jensense reports of “media partner bot” apparently adding content to the main indices.

  6. Well if saving bandwidth was a primary concern you’d understand me cloaking out my images and other non essential stuff …

  7. Thanks for all the info, interesting stuff.

    Any particular reason why Google crawling has slowed so much recently? A lot of webmasters have been noticing this, and I’ve definitely found it to be the case – since about the start of April GoogleBot barely visits any more. Of course, some of this is probably due to the stuff you were mentioning in this post, but the decreasing in crawling is way more than noticable. Some people were discussing this, apparent decreases in the index size, as well as new content not being spidered over at WMW http://www.webmasterworld.com/forum30/33940.htm .

    Any particular reason this is happening, and will googlebot be crawling a bit more enthusiastically any time soon?

  8. It’s good to see people keeping Google on their toes.

  9. Thanks for the long post Matt; great meeting you at Webmasterworld last week – I was the one leaving the party who was on the way to the train and you asked me if I was going back to the Hotel.

    I’ve been reading and citing your blog on webmetricsguru.com and it was also helpful talking with Vanessa about a sitemaps problem we’re having with one of my clients, http://www.ctg123.com.

    Thanks,
    Marshall
    Webmetricsguru.com

  10. Looks like Google’s SERPs have to tolerate my spam even though I don’t want them to show my BH sites on their main index. I only allow Adsense bots in. I will now have to cloak based on the User Agent. So far I was cloaking based on the IP, but that trick won’t work anymore… Aahh!!

  11. Michael Scott, good suggestions. I’ll ask what the status of those pages are.

    graywolf, remember that Googlebot currently doesn’t load actual images, so they don’t cost any extra bandwidth. You’d only gain the bandwidth of dropping an img tag. So gzip encoding would get you much more.

    Marshall, it was nice meeting you! I wish my plane had left later so I could have chatted with more folks. 🙂

  12. Matt, is Google going to setup the proxy on a different set of IPs or will all current Googlebot (all services) IPs act as proxies?

  13. Hi Matt,

    Would a “Prefetch” request from Google Accelerator also qualify as a bot hit?

  14. Thanks Matt for that. Perfectly clear.

    Now, this would mean that adding adsense to pages that are not being crawled any more on my sites is not going to queue them up for crawling. Guessed so, as new pages added to my sites anyway have adsense on them, adsense bot has visited them, but they haven’t been added to the index.

    Something has gone wrong Matt, in the middle of this switch to new types of crawling. I am seeing pages being crawled by less (by any of the bots), and then crawled pages (by googlebot) not making it to the index. This is a new pattern where some have managed to escape it, and some have fallen to it. Earlier, it was like Google knew my site update patterns to a T as if it was working in my office – and now it seems to think I update my site with a couple of pages maybe once in a month 🙂

    I hope you could check with the Crawling Department 🙂 to see if they do know about this issue!

    Matt (not Cutts)

  15. I have personally just had to up my hosting account because of the Inktomi slurp -7.51gig traffic compared to only 4.45gig human traffic on my forum. And Googlebot has only 0.5gig or MSN at 0.6gig.

    So thanks Google for being friendly on the bandwidth.

    Note: the forum is very seo friendly, so only one url per page means that I can get the best use out of the new technology that Google now has.

  16. gray-wolf, I think I may have the solution for your problem.

    What if you used the robots.txt file to block non-page files (e.g. *.gif, *.jpg, *.js, *.css)? That way, you’re not altering your pages at all and Google will still not index the non-essential stuff.

  17. I was just thinking of something Matt,

    I haven’t had a chance to test it yet on any of my live sites but does googlebot ( The new one) still listen to the useragent “googlebot” commands (instead of the long winded Mozilla one?

    For example ALOT of people will have specific rules that they want to follow where they have just put down the user agent as “googlebot” not the mozilla one… Will this need to be changed on our ends?

  18. The google bot indexes very slowly these days (last month or so)..whats going on with that?

  19. Hi Matt, thank you for this information, very useful.

    Would it be fair to say that by adding gzip encoding to pages on a site I updated last year, I was unwittingly blocking the Googlebot? Or hampering Googlebot in some way? Could it still be blocking Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)? As you mentioned it is [i]more likely[/i] to support gzip encoding, but I notice you didn’t say [i]definitely[/i].

    Gzip is great for that site as the pages take a while to load otherwise, but if it compromises the crawl it will have to be taken off.

  20. Hey Matt, you gotta learn to use color in your diagrams like I did in this one about the caching 🙂 Thanks for showing everyone that “AdSense push” wasn’t actually what was happening.

    Toby: the gzip encoding is only done when requested. What Matt meant is that not all the crawlers had the ability to request gzip encoding, which meant that even if you enabled it, it wouldn’t always be used. Enabling it is a good thing, don’t disable it….

  21. Hi Eric, thanks for that clarification.

    Looks like I’ll have to find another reason for Google hating that site 🙂
    Must be the redirects… heh heh

    Have a great day.

  22. Hi Matt

    How are you – hope things are good and happy you have a new cat in your house as per your other post today.

    With reference to improved indexing etc as referenced in the post – As you have probably seen at WMW there are a number of posts regarding sites that are actually losing pages at the moment, also there are still outstanding Canonilization issues (although these sites seem to have their PR back) – but still not ranking/crawling.

    What you say these type of reports fit in to where things stand at the moment regarding new infastructure etc ? Any progress report on these issues at this stage ?

    Cheers

    Stephen

  23. all righty then. Crawl Caching proxy can have a positive impact on Cashing Bandwith.
    Well done.

  24. Matt, I’ve mentioned in various places in the past just how huge a task the spidering of a corpus of 8 billion web pages on a monthly cycle has to be. Of course, there’s a lot of speculation about just how many times larger than 8 billion pages the real corpus of indexed documents is. It is great therefore to read more about ways of making spidering more efficient.

    Matt, has anyone ever made a reasonable estimate of how many HTTP GET requests have to be made a second to keep Google’s index at its current level of ‘freshness’ and comprehensiveness?

    I know that 8 billion pages grabbed once per month requires more than 3,000 GET requests per second, and that with the need to spider many pages far more frequently, the chance of pages being temporarily unreachable, etc, the real figure even for the 8 billion would be several times larger.

    The example you gave shows a ‘technical’ reduction in spidering for the example URL of 50% – but it also shows that the page is still being crawled some 11 times in a month, which if average, would increase that spidering overhead for 8 billion pages to a figure considerably higher than 33,000 pages per second.

  25. Matt, I assume the caching process respects cache directives such as “no-cache”?

  26. Hi Matt,

    Love the posts. Really great information – very helpful indeed.

    I have a question, which would be really helpful for you to answer…
    From crawling how long does it take the new Googlebot to put pages in the index?

  27. Hi Matt,

    I second Stephen’s request for more information on why Google is losing 1,000s of pages from the webmasters who are highlighting the problem at WMW’s thread: “Pages dropping out of the index …”

    Some are assuming that because their website pages were built off a template, Google is seeing those pages as duplicates of others within the site and are not the intrepeting the unique content on each template based page as enough to offset some “filter” believed to have been turned up to max level.

    Besides the usual conspiracy theories and hysterical posts, I do think it’s an issue that could benefit from some clarification.

    Thank you for your time and any information you can contribute to this problem.

  28. Hello,

    I am with Stephen here. I am not sure what is going on with the crawlers but there are enough users reporting the same problems on WMW forum at http://www.webmasterworld.com/forum30/33893-1-10.htm and other threads on that forum and many other webmaster forums that it should be openly discussed by the Google Team to the public.

    I realize Matt you receive all these “comments” on things you have nothing to do with (or maybe you do!) but its the best vehicle of communication users have to G. Doesn’t that tell you something? :p

  29. “This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didn’t know it was live.”

    This is a bit of a shock Matt. Can it be that you are unaware of the 1000s of posts on WMW and other forums (not to mention the Newsgroups) about millions of web pages being erronesously discarded from the index since Big Daddy?

    Not only are people’s 100% spam-free web pages gradually disapearing from Google’s index, but it is quite clear that there is a crawling issue also. New content is simply not making it into the Post-Big-Daddy index at all in many cases. Persoanlly I think it’s a depth issue, new content linked from our Home page goes in, anything deeper does not (Our Home Page has a PR of 5).

    Please tell me that someone at Google knows that there are serious problems? You can’t all think everything is going smoothly can you?

  30. There’s one thing I’m not 100% clear about:

    > Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, service Y wouldn’t get the page from the caching proxy.

    What if the robots.txt file prevents service X from fetching the page but not service Y? Would service X fetch the page and cache it anyway (whilst presumably not indexing it) just in case service Y needed it at a later date? Or would service Y crawl at a later date to cache any pages that service X was disallowed to crawl? (If so, how would service Y know which pages it needed – and would it just end up crawling the whole site again? Does the cache know which pages the bot that provided the cache wasn’t allowed to index?)

    For example, say this was my robots.txt:

    # BOF #

    User-agent: *
    Disallow:

    User-agent: Mediapartners-Google
    Disallow: /not-for-adsense/

    User-agent: Googlebot
    Disallow: /not-for-googlebot/

    # EOF #

    Would the Mediapartners-Google bot fetch and cache any pages in the /not-for-adsense/ directory to save Googlebot from having to fetch them at a later date? Likewise, would Googlebot fetch any pages from the /not-for-googlebot/ directory so that the Mediapartners-Google bot could retrieve them from the cache and index them?

    Could you confirm which approach Google’s bots take and confirm whether adding exclusions to the robots.txt file prevents Google bots from crawling, caching or indexing?

  31. Tony Ruscoe Said:
    What if the robots.txt file prevents service X from fetching the page but not service Y?

    The way i see it is both bots will grab the pages they are allowed to based on robots.txt and the pages will be cached on the proxy. But before they do crawl your site they will check the crawl caching proxy to see if its there already, thus not needed to crawl your site for that page.

    Or i could be completely wrong!

    Matt – I have thought about enabling compression on my server but was worried about the overhead – with the new googlebots having the ability to accept compressed files, would the bandwidth savings out way the extra server overhead?

  32. Matt:
    I was at the conference last week and heard you speak about the BigDaddy bot and gzip encoding.

    I just searched around the internet and still do not quite understand how to set this up. Does the bot see that pages are similar and do the encoding itself or do I have to gzip my files and let the bot know where they are to save all the crawling.

    Thanks,
    Michael

  33. Hi Matt,

    From a white hat standpoint, is it okay with Google if an SEO creates a “replica of the client’s site using a proxy server.” “The replica would be used to test the impact of changes (ie html headers) internally and via the search engines before migrating to main site.

    The duplicate site would be hosted on the SEO’s servers.

    Any guidance you can provide would be much appreciated.

    Thanks,

    Kevin

  34. Matt, sorry to hear about your cat. I remember you saying that you took one to the vet, but I thought everything was ok. …sorry

    Does this new Google cache technology have anything to do with lots of people losing lots of cached pages???

    http://www.webmasterworld.com/forum30/33893-1-10.htm

    I believe this issute started near April 13, but I know BigDaddy rolled out before.

  35. Hi Matt,

    I to am concerend about the frequency bots are visiting, for my pages that means just one from one IP only. And share the same feelings that Matt (the other one) has.

    [quote Matt (not Cutts)]
    Something has gone wrong Matt, in the middle of this switch to new types of crawling. I am seeing pages being crawled by less (by any of the bots), and then crawled pages (by googlebot) not making it to the index.[/quote]

    Could you please have it checked?

    Thanks,

    Tonnie

  36. Yes, the Supplemental Hell problem is still alive for many pages! Please check this Matt – thank you!

  37. This new cache has got to be a prime candidate for the cause of all of the missing-pages/lack-of-crawling/lack-of-indexing bugs, surely?

    Particularly if Google’s right hand didn’t even know what the left hand was doing. If it’s merely a bandwidth saving move then please try switching it off for a couple of weeks to see if your index recovers.

  38. Re the loss of pages … how about this:
    – Google initializes the cache / proxy with known data (perhaps a year old, whenever they started with the tests, ideas)
    – One (or more) of the proxy servers can’t access the web properly (firewall misconfigured, cable forgotten, beer spilt on it, pizza on the connectors, the usual stuff, etc.); they must have many boxes, lots of things can happen 😀
    – When the crawlers demand a page, the proxy returns the known copy (which is old), and can’t update it (access is down)
    – Result: page count goes down to the old state, page / cache content goes back to the old state. -> see postings all over

    But of course nobody will ever admit to spilling that beer 🙂

    Anyway, slightly off topic but still valid, I find it kind of ironic that it’s almost impossible to get Google to remove your unwanted pages from the index (using the robots.txt + meta tags) vs. how hard Google works on removing other pages it doesn’t want to show in the index :D.

    Say, Matt: could this cacheing proxy server be the reason for the Change-Frequency in Google Sitemaps? How does “Priority” fit in (or not at all in this system)?

  39. Matt,
    One question — now that Mediabot spidered pages may be also used in building the regular search results index in Google, doesn’t this mean that sites which have AdSense Ads but which have not made their pages spider-friendly for crawling could suddenly expect to see their deep links appearing in the SERPs?

    Some sites use spider-unfriendly navigation, such as just a search box, for navigating their site and content. It would appear that those sites would now be exposing their pages for indexing, if they display AdSense on their pages.

  40. Matt,
    Thanks for your great updates and comments. I want to add to “FreedomSaid’s” posting about the lost pages. I have a template design on my webpage and to my knowledge never violated any of Google’s rules, but about 10 days ago, my entire DOMAIN and all subpages were removed from Google. Googlebot still visits my pages daily though, so I’m assuming (well, make that hoping to God) that I haven’t been officially “banned.” While I watch my business and career disappear I’ve been searching for solutions and honestly have no idea what to do. At the least, Google providing some information online about updates and the like would be most appreciated. In other words, if this is due to a known problem/glitch, let us know so we (I) don’t think it’s something I need to fix (not that I’d really know, but I’m spinning my ignorant wheels trying to find out). Sorry this was so long; I really appreciate your blog and all I’ve learned from it!

  41. It’s funny how all you spammers notice all this stuff in microseconds and act as if someone just stuck a red hot poker in your eye…hehehe

    (Don’t hate me because I am right)

  42. Hi Matt,

    Gr8 news thanks for this, there are couple of questions regarding crawler proxy caching model clicking my brain.

    (a) What if the caching from the GoogleBot is nil and the caching from different google services is high then one get the advantage of caching in this or not and what about the bandwidth consumption in it.

    (b) Second I am running adsense compaign as you mentioned we can not get the benifit in the rankings and caching. Is there any hidden penalty behind this???

    Cheers
    Vikasamrohi

  43. Matt et al,

    I can confirm that the problem with the lack of Googlebot crawling is mainly due to its normal caching method. Google is simply relying on the Mediapartners-Google bot to cache pages temporarily (http://blog.searchenginewatch.com/blog/060417-020635). Why do I say temporarily? Read on…

    I am an engineer for a big shopping site and our home page hasn’t been cached since April 15th. Our home page is a PR8 and before the BD update it was being cached at least 1x per day. Is this bad? Yes and no.
    Yes, being that the smaller sites out there will probably be affected the most. Anytime someone like Google has to revert back to a previous trunk or checkins…to correct for error(s) across several data-centers, you’re going to see hiccups across the board. In the interim, to correct for this, you will almost always need to use other means to define what’s temporarily missing. Caching in this case, hence the Mediapartners-Google bot. Even if that means slowing down or pausing the processes by which data is created and/or manipulated.
    Since we are mainly dealing with cache here and we know from Matt’s post above…that “as part of the Bigdaddy infrastructure switchover, Google has been working on frameworks for smarter crawling, improved canonicalization, and better indexing”, Google is switching for the better.
    Important thing to remember. When you switch frameworks, especially caching ones, you’ll probably run into a few unforeseen errors when it goes live. This is usually due to the magnitude and complexity of code you are dealing with. Until the bugs are fixed, you probably won’t push it 100% live.
    No, being that larger sites will probably not be nearly as affected here. This is mainly due to the unknowns that make Google…well ummm…Google. We will just never know this, so lets not act like we do. What we do know… is that for obvious reasons, such as: back-linking, amount of content, PR, etc it doesn’t make business sense for Google to drop these types of sites out of thin air.

    So in the meantime friends…sit tight. I can almost promise you that Google engineers are working hard at fixing these issues right now and that everything will be resolved real soon. In the meantime….starting cranking on those new apps for Web 2.0 😉

    Below is a tip I recommend to protect a site from future canonical or caching updates in any search engine albeit Google:

    1. Use absolute link paths or base paths OVER relative paths whenever possible.

    Example:

    Absolute: You use the entire url pointing to the designated page.

    ex. http://www.yoursite.com/page1/index.html

    Relative: You use an predefined path to the file

    ex. /page1/index.html

    Does this help your rankings? Absolutely not.

    Will this lessen the pain if a search engines caching method goes bad? Most likely yes.

    Illustration:

    If you were trying to remember a number out of state but forgot the area code (*cough* domain)…which one would you prefer:

    A. 355-0707 ?

    or

    B. 510-355-0707 ?

    I would prefer B any day. I dunno about you but I don’t have time to find the missing link, in this case an area code. Same goes for your domain. Why would you NOT want your entire sites taxonomy (hierarchy) cached with a “absolute” or base url path?

    Need more? Check out Matt’s post on URL canonicalization and more here: http://www.mattcutts.com/blog/seo-advice-url-canonicalization/

    Hope this answers a few questions and the tip helps a few people out. Take care and God Bless.

    -Jonathan

  44. hello Jonathan

    Well using absolute path is a nice way to get rid from the canonical problem. But its my past experience that on the index page one should use the relative path instead of absolute and for the rest of the inner pages one should have to use the absolute path the balance the outbounds.
    What do you say I have lot examples of check the Google, Hotbot, Miva linking structure.

    Cheers
    Vikasamrohi

  45. pythod, I wouldn’t be surprised if most crawling is from one set of IPs eventually. Except Key_Master, I think the Google Web Accelerator is completely separate right now.

    Michael Scott, I believe the new Mozilla-ish Googlebot absolutely still obeys “Googlebot” in robots.txt. So no need for changes on your end.

    Toby Sodor, each browser/bot tells the web server if it can fetch gzip encoded docs. A properly configured web server will fall back on uncompressed if a browser doesn’t support gzip. It’s possible that what you describe caused problems for you late last year, but I’d guess not. It would be unusual for a web server not to support fallback to uncompressed.

    Ammon Johns, let’s just say that to build a crawler like Google’s, you have to build an industrial strength crawler with multiple bot machines and smart prioritization. You’re right that it’s a lot of work, and bandwidth.

    Ron, I’d have to ask.

    Adrian, the crawl caching proxy is (in my opinion) completely different from the issue of some people’s sites not being crawled as much in Bigdaddy. I was aware of the latter, but not the former. Regarding the latter, GoogleGuy mentioned that you can email to bostonpubcon2006 at gmail dot com with a subject line of “crawlpages” (all one word) to mention your site. Someone is going to look through that feedback.

    Tony Ruscoe, think of it from the perspective of service X. Service X wants url A. It asks “Am I allowed to crawl url A according to robots.txt?” If the answer is yes, it requests that page. That request can sometimes return a cached page from the crawl caching proxy. But another service Y will never request a url A if service Y is forbidden by robots.txt.

    Neil H, it’s usually worth turning on gzip encoding if you serve a lot of static files. If you have a dynamic site, it might not be worth the CPU hit. Unless you’re on an ISP where you pay for bandwidth, but not for CPU–then you might want to go ahead.

    Michael Manning, normally this is a setting you would enable in the Apache web server, rather than gzipping the files yourself.

  46. Thanks for the clarification Matt. I am such a noob when it comes to some of this stuff. I can understand some of the newer sites I help out with not being crawled and indexed, but the 5 year old site I look after is driving me mad – Google will only continuously crawl/index/cache the same 100 pages and ignore the other 1000+. There is no rhyme nor reason to it. I know you won’t comment on individual sites and rightly so, so I’ll ask a general question if I may:

    How does the new Googlebot handle multiple redirects?

    Say someone wanted to move the entire site up a level and redirected all files about a year ago from http://domain.com/subdirectory/* to http://domain.com/*

    Then about 6 months later to fix canonical problems redirected everything from http://domain.com/*.php to http://www.domain.com/*.php

    And after that to avoid duplicate content penalties redirected http://www.domain.com/index.php/category/*/story/* to http://www.domain.com/index.php/story/* (removing the category parameter)

    So, that webmaster is doing up to 3 redirects, can Googlebot handle that? If not, what can Googlebot handle?

  47. Matt,

    You said
    “Adrian, the crawl caching proxy is (in my opinion) completely different from the issue of some people’s sites not being crawled as much in Bigdaddy. I was aware of the latter, but not the former”

    Any advice/help/news on the ‘former’ (caching problem)?

    Many people have this problem (massive thread on WMW)

    Thanks,

    Pete.

  48. This latest change to the way Google crawls sites is another nail in the coffin for small business websites with low PR and a small number of inbound links. Several of my PR4 sites have been deep-crawled only once this month (April) and the content of Google’s index is rapidly filling up with old cache junk. New sites aren’t getting found and the SERPS show nearly all results with old cache. In addition, many low PR pages have been mysteriously de-indexed as it seems Google is increasingly reluctant to deep crawl very low PR sites, prefering to just index the homepage and leave. Just because pages have a low PR, doesn’t mean that their content is worthy of the trash can treatment.

    All of this makes me want to give up my web design job as to be honest, I feel that everything Google unleashes is designed to hurt small business and drive them towards using Adwords. Jagger was the first example where you trashed any site that looked like it had been SEO’d. This latest Big Daddy disaster means if I change and update my site’s content, I’ll now have to wait a month to get it re-cached. To illustrate, I put on a new website for a small company a month ago and that hasn’t even been crawled and indexed yet, despite doing an Add URL using Google’s form and adding a few respectable quality PR4 links pointing to it. It even uses Adwords and still hasn’t been indexed!!

    Please tell me this isn’t the way Google is going to operate in the future! I am certain that the stale old SERPS results it will produce won’t do your market share any good at all.

  49. What is really frustrating is how long it’s taking for the crawlers to pick up new content. Even on newer content sites displaying AdSense, I’m still not seeing them showing as indexed at present, or – at best – simply the homepage, but no real deep content picked up as yet.

    It feels as if the Google search index has been very incomplete for large chunks of this year already, with some site content disappearing and then re-appearing again, and then newer content just not being pciked up, even when the site PR is respectable – but I guess there’s a still a lot for Google to deal with in adapting what must be huge infrastructure to changeover. Hope it’s not too much of a heache for everyone involved.

  50. But how about those kind of proxies???

    I have a serious problem with two of my websites, whose resutls in
    Google were hijacked by a proxy-site called unipeak.net (same as
    unipeak.com) and a result all of my traffic now goes to this site, who
    is simply stealing my content.

    It all began when I saw that traffic to two of my sites dropped
    drastically – 8-10 times. I looked onto the reason and I found that I
    am loosing Google traffic. I checked my rankings and I was pretty
    surprised to see that this site – http://www.unipeak.net
    is ranking right where my sites used to be. It
    is even ranking #1 for my sites’ domain names!

    =

    =

    Same here – the copy of my page at unipeak.net is ranking where
    previously my site was ranking:

    This proxy-site has also parsed and rewritten all of the links on my
    page, it’s unbelievable so really nobody is now coming to my sites as
    a result of this hijacking. I investigated a lot and found no
    explanation on to how has this site managed to made Googlebot believe
    that it is my sites. I found other webmasters complaining of the same
    theft, I saw other sites who’s pages are hijacked by the same site
    (unipeak) but not a solution to the problem. I tried blocking the
    IP’s of the crawlers of unipeak.net/.com but it simply didn’t work.

    I am completely at a loss about how to get my sites back into your
    index. Is there anything I can do to prevent this from happening and/or
    get my rankings back? Is there anything you can do about this issue
    (which seems *very* serious to me if any site could be hijacked this
    way…).

  51. Thanks for the clarification on images. [See matt’s comment above dated ‘April 23, 2006 @ 5:46 pm’]

    And here I was foolishly thinking that google was pulling images. Search engines obviously don’t need to ‘see’ images, yet!

    Yay for caching proxies.

  52. Thank you for the useful info.

  53. Did not know how images play a role in search engine until this post. Thank you.

  54. Thanks for the detailed clarification!

  55. Thanks for the diagrams and clarification.

  56. So, that tells me basically that Google has to store a proxy in the cach? How do I get rid of the one fouling up from the google accelerator? I’ve uninstalled the device, yet it’s still in the LAN settings (albeit shaded).

  57. Hi Matt,

    Is it possible that
    site:http://www.abc.com
    and
    cache:http://www.abc.com
    Both show different caching of the home page at same google data center. If, I see caching of home page through “site:” it shows caching of my sites while I put “cache:siteurl” then it shows caching of another site. I put the url with www and without but result is same and there is no 301 redirection at my site to other. But still, I am facing this kind of problem.

  58. Matt,

    You said
    “Adrian, the crawl caching proxy is (in my opinion) completely different from the issue of some people’s sites not being crawled as much in Bigdaddy. I was aware of the latter, but not the former”

    Any advice/help/news on the ‘former’ (caching problem)?

    Many people have this problem (massive thread on WMW)

    Thanks,
    Sherlock

  59. Dear Matt,

    Right now i am facing one problem with my site. Actually in end of july we have added some pages on our web site and after some days when i try to see caching of my inner pages i am surprised?? Because google show caching of my home page in place of inner page. Please help how it happen and what’s the reason behind this.

  60. Hi Matt,

    Thank you for clearing this issues witch was really hard to understand from other webmaster’s point of view. any how it was clear and detailed information.

    Thanks again.

  61. Hi Matt
    I have purchased a expired domain before few weeks.put a web pages that shows “Under Maintenance” this has been live over a month. url is http://www.scryypy.com whenever i am checking the google cached version it shows me cache of website http://designerpad.org/ (http://209.85.135.104/search?sourceid=navclient&ie=UTF-8&rlz=1T4SKPB_enIN235IN236&q=cache:http%3A%2F%2Fwww.scryypy.com%2F)

    Can you help me why it is and how can i resolve this prob.

    Thanks

  62. At last I found this information. Thanks a lot. It is a part of my diploma now !! 🙂 Any progress in “Bigdaddy” field ? There is not too much information in Internet about this. Thanks again.

  63. Hi Matt,

    I have a site_map.htm page on my site which contains links to all of the pages on my site. google has indexed the page successfully but when I look at the cached page it is months out of date. google states that the page is a snapshot from a crawl on December 27th 2005 but this cannot be the case as the version shown is months old. i.e. I have updated the site map with several new pages but they aren’t appearing on the cached version.

    Any ideas on what the problem is?

    Thanks
    Bishan

  64. Also just as a side note remember it is extremely important to do a proper robots.txt for your proxy if you don’t then your proxy will crawl the web and google will index other peoples websites under your domain. This seems great but its pretty unethical if you ask me and causes a lot more problems then the benefits of fresh content.

    Here is the proper format for PHProxy 5

    User-agent: *
    Disallow : /index.php?

    Just put that in your robots.txt file and upload it to the root of your server.

    There are other ones for cgi proxy and glype.

    If your worried you can’t do well in the Search Engines without stealing content check out my site at http://www.proxybolt.com and I guarentee you I am doing very well in the SE’s and have a ton of visitors without having to steal.

  65. Interesting post. I still do have problems with Googlebot crawling my pages after I turn on optimization (gzip compression) for:
    text/html
    text/plain
    text/xml
    text/css
    application/javascript

    You wouldn’t believe it. Nothing is crawled. I look into it and let you know if I find out what’s the problem. If anyone here knows about it, yell it out! 🙂

  66. I switched to VPS and now my job listings aggregator works like a charm. Well, there was a shift from cPanel to Plesk, as well, but I don’t think cPanel was directly responsible. It was probably some flooding protection…

  67. As a Qualified Microsoft MCSE NT4 and 2000 server series, I know that the original web server calle Proxy server and especially the later editions called ISA 200# (Internet Security Acceleration ) Server caches many pages for your LAN network. So i guess the server models used by ISPs are doing the same. Like your I.E. local PC application – how many times do you use F5 to refresh?

  68. Matt’s diagram pretty much matches what we’ve been seeing, although I still think there might be some issues regarding robots.txt, but we’re still collecting some data on that, so I’m going to wait until that’s done before commenting on it.

    Thanks
    David Janes

  69. Hi Matt,

    What is current status of this? I noticed for each incoming User-request Mediapartners-Google crawls a page 2 times – just because I have 2 AdSence units on a page!!!

    Such an obvious performance bottleneck… I can use HTTP caching to help Mediapartners – but I’ll have to disable gzipped output (problems with Apache HTTPD)

    Additionally, I have some pages restricted from Googlebot, but enabled for Mediapartners-Google.

  70. I have been looking for this info. Thanks Matt!

  71. Ayush Agrawal

    I don’t know if caching and compressions had anything to do with SEO, anyway even if it had, we thought it would only be a positive impact, however we have noticed a significant decline in our SERP rankings (concerned website – FilmiTadka) after we enabled Gzip compression and Apache Keep Alive.

    Please advice am I even thinking in the right direction?

  72. thanks matt i was searching for this, its old but informative any new updates on this

css.php