Video: Crawl dates in the Google cache

Lots of people know that Google shows the date when we last visited a page when you look at a cached page in Google. For example, the cached page for my blog might look like this:

My cache crawl date

You can see the red oval where I’ve circled the time that Googlebot last fetched my blog’s home page. Google was the first major search engine to start showing the crawl date, and I think at this point every major search engine shows the crawl date on cached pages, except for one. *cough* *cough*, sorry, I’ve been under the weather today.

Yesterday, Vanessa Fox did a great post about how we’re changing the crawl dates that you see at the top of Google cached pages to make them more accurate. (By the way, bonus points if you can spot the leetspeak in Vanessa’s post.) Google uses something called “If-Modified-Since” to use less bandwidth when crawling the web (which is a good thing for site owners). In essence, if a page hasn’t changed since the last time we fetched it, there’s no need to fetch it again. But even if we checked whether a page was unchanged, we didn’t update the crawl date in the cache–we’re changing that now.

In case that sounds complex, I made a video last night that uses candy to illustrate the point. Here you go:

Session 16: Crawl dates in the Google cache

Matt talks about how Googlebot crawls the web, and what “crawl date” is shown on cached pages. In this video:
– Red candy is a 404 page
– Purple candy is a 200 (OK) page
– Green candy is a 304 status code (page has not changed)

If you’re someone who prefers to learn visually, I hope the video helps. πŸ™‚

68 Responses to Video: Crawl dates in the Google cache (Leave a comment)

  1. So, did you eat all of them after you finished the video? That’s quite a lot of candy πŸ™‚

  2. I think the current version version in the Apache 1.3 branch is a little more 1337-er than 13:13:37 πŸ™‚

  3. Cool logic Matt! Is there a problem with a page that never shows a 304? For example: A page with dynamic elements that may change hourly, or may change every few days.

  4. As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.

    1) Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits)?
    2) Is a small change like date update or feeds, a change enough to avoid a Google 304 message?

  5. Hey Matt,
    It was terrible man! heheheheh. I just could focus on the M&M’s! hehehe
    Thank you for trying. ;o)
    So, when is the end of Summer to Google? Can we expect big changes on SERPs?
    Have a great week Matt!

  6. I’ve still got Skittles left, Nadir. πŸ™‚

    Aaron Shear, if you’ve got a website where a 304 won’t work or doesn’t make sense, I wouldn’t worry about it. It just makes crawling static sites a little easier on the bandwidth for webmasters.

  7. Jorrit, I had to create a valid timestamp; that was the best I could suggest. πŸ™‚

    Florida Guy, there may still be some movement going on till Sept. 30th or so, but I wouldn’t expect it to be a ton of movement.

  8. Damn, did a long comment but didn’t pass math! ;-(

    Anyhow, just wondering if having a sitemap on my 404.php for humans is ok? I do so because I just can’t be bothered with 301’ing everything.

    I also would like to learn how to watch a Googlebot visit my 404.php to see if it leaves or follows the URLs…thanks.

  9. Matt i’ve been trying to figure out why our website isn’t ranking for a couple of specific key phrases. It appears to rank fine for other key phrases (which are somewhat less important including searching for the domainname). There are no crawl errors, robot.txt errors noted in the webmaster tools. No redirects or doorway pages. There does appear to be some inconsistency when you do a cache:www.siteurl.com – it returns not results. However, when you do a site:www.siteurl.com – it returns everypage within the site with pretty recent cache links (most within the last week). Also, link:www.siteurl.com comes up with no results but allinurl: appears to show a few results. Does this indicate a potential problem? Is it possible for a site to be penalized for certain keywords/phrases? Or is it all or nothing? Does cache play a role in the positioning of a site in the search results? Also, a slightly unrelated question, if alternate text is included in a page to accomodate for eg: users who might not have flash plugin installed (i.e. alternate text describing the site – no keyword stuffing etc – within the object tag), is that legit as far as google is concerned. Thanks for taking the time to educate and maintain this blog.

  10. Matt, you guys still haven’t taken my suggestion and added a link to the real page if the cache result turns up empty, or even a link to the site: results if cache results turn up nothing.

    You know those are one of the 2 places people are going next…

    and I didn’t see the l33tspeak. I even ran it through my Validator and it didn’t find anything other than a “blog” which isn’t reall l33t…

  11. skittles! nice visual lesson πŸ™‚ if only my teacher did that in school.

  12. 13:37

  13. I’m just glad Matt didn’t use T-bones marinated in Diana sauce, which would have been a bigtime waste of meat and a crime punishable by the death penalty.

    Mmmm…T-bone in Diana sauce…:P~~~~~~~~

  14. Ryan: It’s in the next to last paragraph. I’ll bold it.

    We’ve recently changed the date we show for the cached page to reflect when Googlebot last accessed it (whether the page had changed or not). This should make it easier for you to determine the most recent date Googlebot visited the page. For instance, in the above example, the cached version of the page would now say “This is G o o g l e’s cache of http://www.example.com/ as retrieved on August 27, 2006 13:13:37 GMT.”

    Cute. Real cute. πŸ˜‰

  15. Thanks Adam… I wasn’t properly parsing out a : in my validator.. I should get around to fix that. (well actually I was .. as it just considered each thing on different sides of it as different words.. so maybe I shouldn’t fix that.

    Matt.. is this please add 9 and 4 thing a standard plugin? If it is, I predict it will be cracked and start allowing spam real soon.

    The trick with captcha isn’t just to make it harder for bots, it’s to make bots not worth the trouble. Example: Jeremy makes you type Jeremy. It’s easy to make a bot get around that on his blog, but it won’t work on other blogs.

    The trick is to have something related only to your blog.. that way it doesn’t make sense to code a bot to recognize it..

  16. No probs, Ryan.

    Maybe your validator could remove punctuation sans spaces? (i.e. a regular expression that checked for punctuation followed by a non-space character)? Then again, that scenario probably won’t play itself out.

  17. btw it would be nice to have the crawl dates published in the webmasters console (similar to yahoo site explorer). My apologies if this already exists ( I haven’t seen this for the site i monitor eventhough it’s being crawled).

  18. great, now i go down and crawl through the kitchen to find some candy πŸ™‚ You should do that with salad and vegetables, that is much healthier (and politically correct without any brands ;-)!

  19. This cache date change may be good for webmasters checking to see when googlebot was last there but it also seems like a good PR attempt for googlebot by showing the more recent date. I liked seeing the date for the last time Google found a new or updated page more so than than seeing googlebot’s activity. This gives me more of a clue how active the website is and how old the content is. With the number of websites that don’t date their articles or content, I would often check out the cache date just for that reason. If I want to see when googlebot last visited my site I can see those visits in my stats.

    It’s not a big deal, but I think the best solution would be to show both dates.

    Thanks Matt,
    Chris

  20. Matt –

    Two bservations:

    1. – You didn’t mention 410. Is Googlebot ever going to distinguish between a 410 and 404?

    2. – At 3:11 into the video, you showed your true colors. You said that Googlebot goes Slurping along. I knew you were a Y! fan at heart.

  21. The cache date always shows up for HTML documents.

    Why is there no cache date on Google’s HTML cache version of PDF and other files though?

    Would that feature be easy to add?

    I asked this about a year ago; guess it got overlooked.

  22. g1smd, my guess is that it wouldn’t be trivial to add crawl dates to PDF, but I’ll ask.

    BillyS, so many webmasters misuse 404 vs. 410 that I don’t expect we’ll distinguish between them any time soon. I didn’t even realize that I said “slurping.” πŸ™‚

    Chris Dohman, my guess is that most people wouldn’t appreciate the difference if we slapped both on the cached page. In my mind, showing the latest date is the most accurate, because we checked on that date if the document was unchanged. If the document had changed in any way, the webserver wouldn’t have returned a 304, and we would have gotten the changed document at that time. So the date that we show is the date when we last had a valid copy of the doc, according to the webserver.

  23. This is a really unhelpful change in the cahe reporting, Matt.

    I would prefer that Google include both dates. It is more informative to know when the last actual fetch was.

  24. Matt, rather than crawl dates for PDF how about making it simple and ban them from Google’s index πŸ™‚ I HATE when I click a PDF link when I didn’t notice the HTML one!

  25. Well well well if it isn’t the yahoo basking saga again……. However good point one thing though when you can tell your self when you update your site to see when it was crawled last based on the content!

  26. I agree with Michael Martinez and Matt’s on the date display. Unless Google display both date with correct wording, otherwise it’s confusing in either way.

  27. Or use a CSS class to make them a different colour.

    You could use purple Skittle colour for PDFs (although you’d have to change the colour of visited links), brown chocolatey M&M inside for Word docs, green M&M for Excel spreadsheets, red M&M for PowerPoints, etc.

    Okay, change the colours so they don’t reflect candy…now I’m snacky. Can I use my bonus points for guessing at the 1337speak to buy Google Gum? :p

  28. But even if we checked whether a page was unchanged, we didn’t update the crawl date in the cache–we’re changing that now.

    This is a very good idea, i the impression was that pages were NOT crawled since the date indicated

  29. Candy is dandy but SEO won’t rot your teeth.

  30. we did a site:www.makemethis.com today…noticed in page two of results that two other domains were being listed with our site…

    Mailfriends, Penpals, Ecards, Postcards, Chat, Find friends – Profile
    Find Penpals & Mailfriends mail friends from all over the world for free on the largest penpals community. Including hundreds of Ecards!
    mailfriends.com/profile/dew – 16k – Supplemental Result – Cached – Similar pages

    Shipping & Returns : MakeMeThis.com
    MakeMeThis.com : Shipping & Returns. … ID Bracelets Child Medical Jewelry Medical IDs Military Dog Tags Soccer Double Hearts PB Bunny …
    http://www.makemethis.com/shipping.php – 19k – Supplemental Result – Cached – Similar pages

    Christiana Baptist Church – Company Profile | D&B Company Reports …
    Manta > Browse by Industry > Public Sector > Associations / Non – Profits > Religious Organizations : C > Christiana Baptist Church …
    http://www.manta.com/coms2/dnbcompany_gd28v – 25k – Supplemental Result – Cached – Similar pages

    The most recent date on cached pages would be great but is less important than not having other sites show up in our site…even if they are just supplemental ones…anything we can do to correct this?…It just showed up today…

  31. Matt,

    Are we seeing a data refresh on gfe-au right now?

    I noticed some ranking movement there and saw that my site went from #1 allinanchor:KWphrase to AWOL allinanchor:KWphrase, but appending &filter=0 to the query, comes back at the top of the 2nd page.

  32. I’ll be my anwser is closer than someones answer to the question N/0 ;).

    Matt are you folks aware of the %c0 appearing after the tld in your serps?

    It gets used to attempt location of the cache page if the cache link is clicked but not used in the link to the actual site.

    It also makes the site:mydomain.com command a bit sparce.

    Now your simple math question is please calculate the volume of a cylinder that has radius 1 yard and is 1 yard long. ;).

  33. TxRex,

    There’s a thread on WMW right now about the same thing, so I’d hope they are working on it.

    Best of luck.

  34. This is really very nice information. I have just checked my website and Google cached is showing 5 Sep 2006 16:21:04 GMT. It’s really very fast. πŸ™‚

  35. It’s really very nice information. Because of this cache information I will know when Google has visited my website.

  36. TxRex, they’re aware of this and it’s not a huge deal, but the supplemental results team is fixing it as we speak.

    Michael Martinez, that information is always there in your server logs if you want it. Or you could remove the 304 If-Modified-Since stuff in the webserver to force Googlebot to always grab the document.

  37. Very interesting. I like your blog and i apreciate your seo skills

  38. Very nice explanation on that!
    But what if you’ve changed your site but the G cache still having the old version?
    Thanks for your help!

  39. If the if-modified-since logic checks only to see if a file is newer than the last time it looked at that URL, then there is a gaping logic hole to have a page of content that doesn’t match the words that it ranks for in the SERPs, and have it show up for wrong content for quite a while until Google pulls the whole page again regardless as to whether the file is tagged as changed or not.

    I already did a “proof of concept” a year ago and validated it last month again. I originally found it by accident when renaming old pages on the server and getting a few names round the wrong way.

  40. The odd thing about this is “Google is neither affiliated with the authors of this page nor responsible for its content.”
    Funny, I thought you still worked for google, so there goes the affiliation, and since alot of this blog is about working with google, and what they do, good, bad and indifferent, their kind of responsible about the subject matter.
    Great Post Matt
    Keep em comin…

  41. g1smd etc.,

    So what you are saying that a less than scrupulous webmaster could somehow game the system with this knowledge? Perhaps take advantage of the supplimental crawl cycle of 6 months to a year and have a page indexed for keywords that really have nothing to do with the page anymore?

  42. Hi As you Said,

    “In essence, if a page hasn’t changed since the last time we fetched it, there’s no need to fetch it again. But even if we checked whether a page was unchanged, we didn’t update the crawl date in the cache–we’re changing that now.”

    Lot of web sites use date and time controls these days for crawler’s frequent visit. Do you think this is a change for a web page???

    Cheers
    Vikas

  43. I slapped together a small tool to check the handling of IfModifiedSince queries – if anyone’s unsure how their site is handling, feel free to try it with http://oy-oy.eu/page/ifmodified/ (I also have sample asp and asp.net code for those who want to integrate it, php is up next πŸ™‚ ).

    Matt, any clues as to why the Googlebot still gets the full page without IfModified Since from time to time? Is that to be certain that it’s not missing anything (probably)?

    Also, with IfModifiedSince – can this penalize a site that has small sections of dynamic content which might have been seen as part of the content (eg “this page has been stale for a few months, I’ll rank it lower” vs “this page has regular content-updates, it must be good”)?

  44. One more item: why is the cached date wrong sometimes? I have several sites where I dynamically store the access date/time + useragent in meta-tags (which tells me where scrapers are coming from when I see the scraped content) – sometimes the cache date is off by several days. Is that a problem which you might want to know about or is that by design? It doesn’t really bother me, but perhaps it’s important for you to know πŸ˜€

  45. Just being nit-picky: GMT went out years ago, why not use UTC instead? πŸ˜‰

  46. Oh no, not another one :-): Any plans on handling IfNoneMatch / ETags?

  47. Keep all these videos coming πŸ™‚ All you got to do is just put your feet on the desk and watch for a while πŸ™‚ I like it! reading gets boring if you do it all day…

  48. Hey matt, does this mean that since Googlebot can’t view a modified date for a dynamic page, that dynamic pages would take longer to get crawled because you must actually scan the page to see if its has changed?

  49. Hi Matt,

    I’ve done “PageRank Management” on my site i.e. I’ve done the links to the pages that I want to have high PR as static links, while the other links are Javascript (hopefully non-PR passing) links. Is that a problem? Is that against Google guidelines? My site was hammered by the last two updates, and I have and never had hidden text, cloaking or anything like that.

    Is it possible that “PageRank Management” might be part of the “overoptimization” that you were talking about? Is it possible that my site was hammered because of that?

    Thanks!

  50. Hi Matt, Have you checked your new wardrobe ? I like the underwear…

    http://www.cartoondollemporium.com/mattcutts.html#27

  51. * * * Perhaps take advantage of the supplemental crawl cycle of 6 months to a year and have a page indexed for keywords that really have nothing to do with the page anymore? * * *

    People have already done that, but supplemental pages often don’t rank well for popular terms, so not a lot to be gained there I think..

    No, my observation was about having a generic page about an annual event that ranked well for several terms, and had been indexed for a long tim, and then one day hiding that version of the page by renaming it and then uploading a new page at the old URL with specific information about just that years event.

    After having that page on the server for a month, and it having been indexed several times, it was deleted, and the generic page renamed back. So now the old page, with the old content, with an older date than had just been there, was now live.

    It took Google a long time to see that change. Google accessed the site many times, and as the date was not newer, it assumed that the page was unchanged. They seemingly had no concept that a page could get older. But it can.

  52. “Michael Martinez, that information is always there in your server logs if you want it. Or you could remove the 304 If-Modified-Since stuff in the webserver to force Googlebot to always grab the document.”

    Matt, you’re assuming I would only want to know when Googlebot came to fetch data from my own sites.

    I very often look at Google’s information on Web sites for other people (many of whom don’t have access to server logs, don’t know how to read server logs, and sometimes don’t have much control over the content of their own pages).

    Since you’re no longer reporting when page data was actually fetched, you’ve pretty much tied my hands and the hands of many other people who try to help others — not to mention those Webmasters who don’t have access to their server logs, who don’t know how to change server configurations (and many of whom cannot change their server configurations because of webhosting restrictions).

    You really need to stop and think about this change from the Webmaster’s point of view. That cache date is, for many people, the only indication of what may be going on. People can upload pages, lose those changes temporarily, and not be aware that Googlebot has come and gone on a rare visit until they see the LAST FETCH DATE in the Google cache.

    Servers go down, backups are restored over new files, backups are restored over replacement servers, etc.

    There is no real value for any Webmaster in knowing that Googlebot came by yesterday, got a code 304 response, and moved on. There is value in knowing when the page was last fetched.

    So, the compromise that will serve everyone’s best interests is to report both dates.

  53. Why we can’t see the Google Video in China..
    Everytime I want to look the Video, in vainly. Can Matt Cutts also post them in another sites?

  54. Hi Matt

    Some site are created using frames and when you click to see Cache, the information is hidden and one cannot access it. Is there any negative impact in Google from this.

  55. Thanks for the info!

  56. Why Google Crawl&Cache some site every day but some site every week or month ????

  57. We have been experimenting with building a crawler ourselves this year, I reckon everyone in the SEO sphere should do it for the experience. I now have a LOT more empathy for what your crawler team must go through on a daily basis and the sort of issues you face and why…darn people do some dumb stuff into their URLs and server configs.

    1337 – h0// c0//3 1 n3v3r 590773d 7h15 b4?
    http://www.google.com/intl/xx-hacker/
    v.7unny

  58. Since the middle of August I have had about 100,000 pages in the Google index (finally). However every 6 – 8 days the number of pages in the index drops to about 100 – 350 pages. It stays that way for about 12 – 18 hours and then bounces back up to almost 200,000 pages and then 12 – 18 hours more, drops back to around 100,000 pages and stays consistant.

    What the heck is causing this?

    Also, my server doesn’t have a sound card and my laptop’s hard drive sounds like someone took a hammer to it (grinds horribly) so wasn’t able to watch you video yet, can you tell me if the date I have on every page of my site (http://www.lyricvault.com) will cause any issues regarding caching of my pages?

    Last thing, I have a lyrics page for every song for sale in the U.S. (depending on when I last updated the db, do it about every two weeks or so) however only 300,000 or so actually have the lyrics on them (the purpose of the site). I am noticing that a lot of people that land on a lyric page that has no lyrics will submit them (I try to make it real easy for them) however it will take weeks or months for Google to update their cache of the new content on the page (i.e. the added song lyrics). Is there something I can do to tell Google with a big red neon sign that the page has significantly changed and the change is paramount to the purpose of the site so please update the cache on the sucker?

    Thanks a million,

    Brent

  59. Now you’ve gone and made me hungry…

  60. Luis Filipe Fabiani

    Matt, check out number 7 at http://www.ericward.com/linkmoses/ten.html

  61. Sorry to chime in so late with this but I thought I’d share it. The old method of showing the date allowed me to use the cache as a kind of “has this page recently changed?” tool. I can’t do that anymore. No big deal, but there ya go.

  62. Hi Matt,

    In your efforts to save bandwidth/trees/whales/etc, do you also make use of the Content-MD5 (and Content-Length) header to avoid fetching mirrored data? I happen to provide this for my (often large) multimedia exhibits and it would be nice if G could realise/guess that with the same URL suffix, length, last-mod date and domain that it does not need to fetch the mirrored content at all. This would mean you could still direct local users to their local mirrored copy but save us all lots of bandwidth/trees/karma/etc. (Obviously you’d want to verify the content on a random sample to avoid black-hattery…)

    Rgds

    Damon

  63. Hi, Matt why is my site sometimes 254 pages, sometimes 165, sometimes 132.(using site:www.mysite.tld) In the google cache, when it is low the pages are missing. When it is high the pages are in the cache. How many caches are there anyway? This has been happening for 6 weeks and the changes seem to happen over night with no ryme or reason.

    Also the cache date for pages is sometimes back in May. But other days the same page has a cache date in August or September. How does this relate to the image you have with the cahce date circled in red.

  64. My page has been updated about 4 times in the past few months and it seems like google has not crawled it for about 9 months. So the cached version of my site now is extremely outdated. I had assumed that it would pick up the latest change i made, and the less recent change would show up in the cached version. However, this does not happen, and a really, really outdated version of my page shows up in the cached version. Why is this so?

  65. Hello Matt,

    i have another question, how can i change the cache of a page?

    if i modify the content of a page, and google crawls the page the same day, when will the chache change?

  66. Hey Matt,

    Something small: could you please categorise this post as ‘Movies/Videos’? πŸ˜€

    I’ve recently started reading your blog and, well, keep up the good work! I particularly enjoyed the SEO issues you talked about in those videos you recorded back in 2006. Though, for some reason, when downloading clips from Google Video, the chances of me getting a file without it stopping at some point are about 50/50. Strange.

    Brian

  67. Thanks Matt, for the Vanessa Fox reference post link.

  68. Just came across this on a Google search in trying to change the dates of Google’s cache of a particular web page. It looks like the video is no longer working, would you be able to link/embed it?

    Also, would it be the same as this video:
    http://www.youtube.com/watch?v=8lmZS7TknQc

css.php