If you’re a search engine geek, you’ll enjoy this study of search engine freshness over a six-week period starting in February, 2005: http://eprints.rclis.org/archive/00004619/01/JIS_preprint.pdf (PDF). Found via the excellent SEW blog, which noticed it on PhilB’s blog.
I’ll Cliff-note it out for you. The authors tracked Google, Yahoo, and MSN over 42 days using 38 German webpages that were updated daily and that included a datestamp somewhere on the page. They measured freshness by looking at each search engine’s cached page to see how up-to-date the page was. If you measure success by having a version of a page within 0 or 1 days, Google succeeded a little under 83% of the time, MSN succeeded 48% of the time, and Yahoo succeeded about 42% of the time. They also the measured average age of pages. On average, pages from their sample were 3.1 days old in Google, 3.5 days old in MSN, and 9.8 days in Yahoo. They reach the conclusion
[Yahoo] updates its index in a rather chaotic way, so one can neither be sure how old the index actually is nor if a large proportion of the pages in the index was updated recently.
which seems a little harsh to me–I think Yahoo had an older data center in the mix. But I have caught myself wishing that Yahoo would show the crawl-date on its cached pages (both Google and MSN show the crawl-date if you click on the cached page).
The authors had a few questions that I think I can clear up. First, they noticed that Google can have pages that are more than a month old in its index. I suspect that those results were Supplemental Results. Supplemental results can lag behind our regular index. The authors also noticed that Yahoo’s dates sometimes alternated between fresher and older versions of a page. I’m guessing that that’s because Yahoo had two different data centers, one with older data, and sometimes the query hit the data centers with older data.
Finally, they note that Greg Notess studied this a while ago and they weren’t sure how Greg assessed the age of a page. After all, Greg was analyzing freshness years ago, and back then only Google offered cached pages. How did Greg know when a page was crawled? The answer is (or at least my hunch is) that Greg did something really smart: he found a bunch of pages where the title of the page included the current date. For example, this page always has the current date and time in the title. So by doing searches that returned those pages, he could tell when a search engine had crawled that page. Of course, now most major search engines let you access a snapshot of the page that they crawled, which allows for a better sample.
I’m leaving out a lot of interesting bits of the study, because you should really go and read all 29 pages yourself. Go on. Do it. You call yourself a search engine geek and you can’t spare an hour to read about freshness? Do it.