Measuring freshness

If you’re a search engine geek, you’ll enjoy this study of search engine freshness over a six-week period starting in February, 2005: http://eprints.rclis.org/archive/00004619/01/JIS_preprint.pdf (PDF). Found via the excellent SEW blog, which noticed it on PhilB’s blog.

I’ll Cliff-note it out for you. The authors tracked Google, Yahoo, and MSN over 42 days using 38 German webpages that were updated daily and that included a datestamp somewhere on the page. They measured freshness by looking at each search engine’s cached page to see how up-to-date the page was. If you measure success by having a version of a page within 0 or 1 days, Google succeeded a little under 83% of the time, MSN succeeded 48% of the time, and Yahoo succeeded about 42% of the time. They also the measured average age of pages. On average, pages from their sample were 3.1 days old in Google, 3.5 days old in MSN, and 9.8 days in Yahoo. They reach the conclusion

[Yahoo] updates its index in a rather chaotic way, so one can neither be sure how old the index actually is nor if a large proportion of the pages in the index was updated recently.

which seems a little harsh to me–I think Yahoo had an older data center in the mix. But I have caught myself wishing that Yahoo would show the crawl-date on its cached pages (both Google and MSN show the crawl-date if you click on the cached page).

The authors had a few questions that I think I can clear up. First, they noticed that Google can have pages that are more than a month old in its index. I suspect that those results were Supplemental Results. Supplemental results can lag behind our regular index. The authors also noticed that Yahoo’s dates sometimes alternated between fresher and older versions of a page. I’m guessing that that’s because Yahoo had two different data centers, one with older data, and sometimes the query hit the data centers with older data.

Finally, they note that Greg Notess studied this a while ago and they weren’t sure how Greg assessed the age of a page. After all, Greg was analyzing freshness years ago, and back then only Google offered cached pages. How did Greg know when a page was crawled? The answer is (or at least my hunch is) that Greg did something really smart: he found a bunch of pages where the title of the page included the current date. For example, this page always has the current date and time in the title. So by doing searches that returned those pages, he could tell when a search engine had crawled that page. Of course, now most major search engines let you access a snapshot of the page that they crawled, which allows for a better sample.

I’m leaving out a lot of interesting bits of the study, because you should really go and read all 29 pages yourself. Go on. Do it. You call yourself a search engine geek and you can’t spare an hour to read about freshness? Do it.

27 Responses to Measuring freshness (Leave a comment)

  1. Hi Matt

    Great post!

    “Google can have pages that are more than a month old in its index. I suspect that those results were Supplemental Results. Supplemental results can lag behind our regular index.”

    And that leads me to ask you to elaborate more about “supplemental results”. Are they always duplicates of already existing contents? and how do you choose which version to show as supplemental result?
    Is the document history the most deciding factor in this connection?

    Thanks.

  2. Thanks Matt… for sharing links with us … yes ofcourse we will spare time to read ALL about search engines ..;)

  3. Great success for Google (again), but there’s something about the Supplemental Results I don’t understand. I see a lot of results in there, which are “very” old, up to 18 Month. I also see a lot pages which have been and are still delivering a 404 for a long period now. Maybe you could clearify this a bit some day.

  4. Hi Matt, Indeed, I noticed the problem of freshness myself, as far as directory-searches are concerned. Google succeeds quite well to rank the newer pages higher than the older ones, whereas the other search engines don’t.

    Matt, I became quite an expert on the problems of search engines, and the presentation of results. For example, did you know that one gets to have more fresh results if he or she phrases the search-phrase itself in a better way?? You can try it yourself: try to put in only one or two unrelated words, with no context-related word into the search phrase, and you would see that you get old results even on Google. Then, add to the same search-phrase any context-related word — and you would see that you get more fresh results. This means that the search-engine, and its freshness, are “search-phrase-sensitive.”
    Irit.

  5. Matt,

    Can you explain Google’s supplemental index? Why exactly does Google use it, and when does it fall back to using it in queries?

    For some (somewhat esoteric) queries I quite often end up in the supplemental index often with results coming from (for instance) mailing list archives.

  6. Hello everyone,

    A page displaying a picture of Matt’s cat need not be fresh when it might never, ever substantially change. A dynamic site, on the other hand, with material changes every day, should get refreshed often. That is a vital factor for measuring engines against each other.

    -d

  7. Hi Matt,

    The study seems to confirm the thoughts around the SEO world.
    One idea pointed out by the authors I’d like you to address is the theory about Google counting clics.
    The rumor is becoming more and more persistent that Google, Yahoo and MSN use their toolbars (checksum for the Google Toolbar) to count clics made on links that appear on a web page. Therefore, a backlink would be more pertinent if it is clicked on many times.
    Considering that the PageRank is not involved so much in ranking, I suppose you have to find new ways to find if a page is popular or not. Thus the clics could be a very clever manner to estimate popularity.
    I hope that my explanation is clear :-S
    Please tell us more on this topic.
    Thanks

  8. Thanks for the link! The paper was intersting.

    I find that my freshness problems are usually the reverse of what people might generally assume as the problems one would have. When I notice freshness issues it is because I search for something, find a link that mentions what I want, and click on the link to find the content is not there (generally this relates to mailing lists or message boards about various computer problems).

    It would be interesting to study how often people want the stale information. Internally, a search engine could study this by keeping track of how often users go to a site and then go to the cached version of the same site.

  9. I agree: 2- New Domains are initially in the worst possible position to compete
    (as opposed to an established site simply adding another page)

    http://www.referatde.com

  10. The theory of counting clicks is interesting, although I’m sure that if it becomes known there will be those who find a way to manipulate it. I myself never realized that Google beat out the other engines on freshness, especially by such a margin. For my sites, MSN has always been the first to update their cache. Guess I’m an exception to the rule.

    When searching for files I also tend to use the other engines, mainly because I often find what I’m looking for at newer sites, and these tend to be sandboxed by Google…

  11. Well that didn’t work out very well did it? LOL darned caret wiped out everything. Continuing…

    Oddly, Yahoo do provide something sort of like a last crawl date via their API. They call it ModificationDate.

    A friend was asking how to tell the last crawl date a couple of months ago so I whipped up a little Yahoo API script to help them out. The thing is that ModificationDate doesn’t really show the last crawl date apparently. It looks like Slurp is grabbing the file date from the server during its crawl, and that’s the date you get via the API.

    So my little script still works to find the last crawl just fine for dynamic sites, since by their nature they’re going to report the current date. Doesn’t work at all for static sites though. Too bad on that count.

    Give a yell if you want to play with the script Matt. I’ve already GPL’d it and released it.

  12. If you search Yahoo’s index via Position Tech the last crawled dates are displayed.

    http://search.positiontech.com/Yahoo/PositionTechSearch.jsp

  13. Lopaka posted: If you search Yahoo’s index via Position Tech the last crawled dates are displayed.

    PT’s using the same ModificationDate I mentioned above. Best I can tell it’s not the Last Crawled date. It’s the date the file was last modified as reported by the server.

    Works out to be Last Crawled for dynamic sites, but not necessarily for static sites.

  14. I as well was wondering if you could explain what supplemental results are.

    Thanks.

  15. Matt,

    Speaking of Freshness, how about search parameters we can use in google to find:

    1. How old a link relationship is and (age of a link) and
    2. Websites NOT updated in the last X months or years (the inverse of “restrict your results to the past three, six, or twelve month periods”)

    That would be super helpful.

    Thanks!

  16. The theory of counting clicks is interesting, although I’m sure that if it becomes known there will be those who find a way to manipulate it. I myself never realized that Google beat out the other engines on freshness, especially by such a margin. For my sites, MSN has always been the first to update their cache. Guess I’m an exception to the rule.
    http://www.referatonline.com
    http://www.referatonline.com/referate/Geographie/29/Geographie2.php
    When searching for files I also tend to use the other engines, mainly because I often find what I’m looking for at newer sites, and these tend to be sandboxed by Google…
    http://www.referatonline.com/referate/Geographie/25/Geographie2.php
    http://www.referatonline.com/referate/Geographie/27/Geographie2.php

    Anymau

  17. A quick note to say that the cited link did not work for me; I did have success if it is all lower case:

    http://eprints.rclis.org/archive/00004619/01/jis_preprint.pdf

    Your postings are very informative. Thank you!

    – Sean

  18. Hi Matt,

    thank you for your great Blog. A friend of mine told me about it and the informations are really great.

    Great Site and greetings from Germany,
    Thorsten

  19. Hi Matt,

    Thank you for a very informative blog!

    Quick question. Can you tell us how to get out of the supplemental results index? We have recently re-launched our site and would like to get back on the SERPs.

    Please advise.

    Webmaster
    Discount Hotels

  20. Ed, I am not Matt but I think the only way to get out of supplemental results is to read the google guidline as the priest reads his bible over and over again. You should be strictly using only natural link building and white hat seo.

  21. I agree that google guidelines are not a bible. They change their code every 3 months anyway.

  22. Good tips on adding current date on the site.

  23. I really enjoyed the post, but I’ll try t add some tips on how you can avoid being a site that’s a Google “supplemental result”.

    1) Be careful what you quote from other sites. Don’t quote half an article or Google will push you down!
    2) Keep your title to a maximum of 60 characters. (In 60 chars…)

    Hope it’ll help.

  24. Hey Matt,
    your Blog is very great…

    The best Seo Blog in the Web with good Information`s.

    Greats from Germany

  25. This page isn’t so fresh, though to be fair, the time stamp says as much. The link to the study is broken anyways.

  26. Matt, I was looking for info on your blog about (keywords:) the ‘supplemental index’ …

  27. Dear Matt, thanks for commenting on my study. Maybe you are interested in the follow-up study covering three years (2005-2007). It can be found on my homepage (http://www.durchdenken.de/lewandowski/doc/JIS2008_preprint.pdf) and will be published in the Journal of Information Science within the next months.
    Regards, Dirk.
    PS: None of the pages reported on was found in Google’s supplemental index.

css.php