When I joined Google in early 2000, we had a stretch where we didn’t update our index for 3-4 months or more. At the time, that wasn’t bad for a search engine; I remember one search engine around then that wasn’t updated for over a year. Starting in mid-2000, Google updated our index pretty much every month. People used to use the phase of the moon to predict timing of the next “Google Dance.”
Now raise your hand if you remember “Update Fritz” from summer 2003. That was the Google Dance where Google switched from a monthly batch update to an incremental update. That means that our crawl/indexing team updated a fraction of our index daily or near-daily. Back then we had not only the normal crawl but also a “fresh crawl,” and if documents were in the fresh crawl then Google would sometimes show a date in our snippet.
The Google crawl/indexing team has continued working hard, and several people have noticed Google’s index getting fresher and fresher. Now some documents can show up in minutes instead of hours or days.
I’ve noticed that as search engines have gotter better (fresher, bigger, more relevant), people keep adjusting their expectations upwards. I can’t imagine waiting over a month for search engines to update their index with news events any more, but just a few years ago that’s how things worked. And it only takes a few encounters with a fresh index until you ratchet up your expectations. My previous mental model was “normally it takes a day or so to show up in many search engines,” but I had my own “Zoiks! That’s fast!” experience tonight, which I’ll describe for you.
I was feed-grazing in Google Reader, as I am wont to do, when I saw a message that there was an update to Reader’s code for offline reading (Google Gears). In my experience, if I move on the the next feed, I lose that little message with the link to update the code (not sure why, but that’s a different post). So I click the link and update my code for offline reading.
In the process, I lost the post that I was currently reading, which was Rich Skrenta’s post about Persai. I wasn’t done reading the post, so what do I do? I go to Google and search for [skrenta blog] so that I can find Skrenta’s blog and finish reading the post.
And what did I see in my search results? The snippet from Skrenta’s blog was showing the post that he did at 7:54 p.m. Pacific time. It was about 8:44 p.m. Pacific time when I did the search. So from Rich hitting the “Post” button to me being able to see it in Google’s main search index was well under an hour.
Don’t believe me? Here’s the bottom of Rich’s post, showing that it went live at 7:54 p.m.:
I double-checked that Rich’s blog was on Pacific time by leaving a quick comment on one of Rich’s other posts.
And here’s Google showing a snippet from Rich’s post within an hour after it went live:
Now that’s a minty fresh index. It takes a lot of good design and infrastructure to be able to refresh large numbers of pages that fast. Congrats to the Googlers who are improving Google’s ability to re-crawl, index, and score web documents quickly.
Update: I was only checking every 10 minutes or so, but this post was crawled/indexed/searchable in half an hour or less: