Archives for September 2005

Bacon polenta

(another post-without-particular-polishing)

You have to leave room in your life for serendipity sometimes. For example, without an accident, I never would have discovered the joy of bacon polenta. (All the people who say “Matt, I thought you were going low-carb?” can step off. I’m playing hockey today. I’ll skate it off.) The Google cafe may like to call it creamy polenta, but as a Southern boy, I just think of it as cheese grits. It’s even better with bacon. Mmmm. Bacon-y goodness.

Accidents in crawling/indexing/scoring happen too. Sometimes they’re happy: “If we turn this factor off, scoring gets better? Cool!”. Sometimes they’re unhappy: “What happened to this page?” One of my least favorite accidents is when someone reports a 301 or 302 problem. The heuristics we put into a place have greatly reduced complaints about “302 hijacking.” For the first time in ~1 month, I got an email about a “302 hijacking”. This case was especially interesting because I got an email from both sides: someone from the destination site wrote, and the source site also wrote to say “we didn’t mean for this to happen.” I take that as a kinda good sign; when I hear about it from both ends, 302 problems are hopefully much more rare. I passed the info on to the mailing list we have for that, and I’ve asked a colleague to email both sides when we get it debugged.

What do you do if you suspect a “302 hijacking” but don’t have my email address? There’s a convenient way that should get your report to the same engineering list, where it will get the same level of investigation. Go to http://www.google.com/support/bin/request.py and click “I’m a webmaster inquiring about my website” then select “Why my site disappeared from the search results or dropped in ranking” and click continue. In the webform that you get to, make sure you put “canonicalpage” in the Subject line, then put the details in the Message body. Someone will route that message to an engineering mailing list where we dissect claims of canonicalization problems (that is, picking the wrong url).

I also got one email today about a site being indexed under both www.domain.com and domain.com. The proper procedure (assuming that you want www.domain.com to show up) is to make domain.com do a permanent (301) redirect to www.domain.com. The person that wrote said that we hadn’t crawled domain.com recently to find the 301/permanent redirect. I’d be curious to hear feedback (in the same way as the paragraph above) to see how many other people are running into this issue.

Measuring freshness

If you’re a search engine geek, you’ll enjoy this study of search engine freshness over a six-week period starting in February, 2005: http://eprints.rclis.org/archive/00004619/01/JIS_preprint.pdf (PDF). Found via the excellent SEW blog, which noticed it on PhilB’s blog.

I’ll Cliff-note it out for you. The authors tracked Google, Yahoo, and MSN over 42 days using 38 German webpages that were updated daily and that included a datestamp somewhere on the page. They measured freshness by looking at each search engine’s cached page to see how up-to-date the page was. If you measure success by having a version of a page within 0 or 1 days, Google succeeded a little under 83% of the time, MSN succeeded 48% of the time, and Yahoo succeeded about 42% of the time. They also the measured average age of pages. On average, pages from their sample were 3.1 days old in Google, 3.5 days old in MSN, and 9.8 days in Yahoo. They reach the conclusion

[Yahoo] updates its index in a rather chaotic way, so one can neither be sure how old the index actually is nor if a large proportion of the pages in the index was updated recently.

which seems a little harsh to me–I think Yahoo had an older data center in the mix. But I have caught myself wishing that Yahoo would show the crawl-date on its cached pages (both Google and MSN show the crawl-date if you click on the cached page).

The authors had a few questions that I think I can clear up. First, they noticed that Google can have pages that are more than a month old in its index. I suspect that those results were Supplemental Results. Supplemental results can lag behind our regular index. The authors also noticed that Yahoo’s dates sometimes alternated between fresher and older versions of a page. I’m guessing that that’s because Yahoo had two different data centers, one with older data, and sometimes the query hit the data centers with older data.

Finally, they note that Greg Notess studied this a while ago and they weren’t sure how Greg assessed the age of a page. After all, Greg was analyzing freshness years ago, and back then only Google offered cached pages. How did Greg know when a page was crawled? The answer is (or at least my hunch is) that Greg did something really smart: he found a bunch of pages where the title of the page included the current date. For example, this page always has the current date and time in the title. So by doing searches that returned those pages, he could tell when a search engine had crawled that page. Of course, now most major search engines let you access a snapshot of the page that they crawled, which allows for a better sample.

I’m leaving out a lot of interesting bits of the study, because you should really go and read all 29 pages yourself. Go on. Do it. You call yourself a search engine geek and you can’t spare an hour to read about freshness? Do it.

One month

My site has now been live for about a month–I launched it during SES San Jose (I wrote a few test posts before that to get the hang of WordPress, but the site wasn’t visible until a month ago). I’ve learned that making new content is time-consuming. And I’ve got a better appreciation for the things that a site owner faces, like a server going down. It’s been a blast though. I’m delighted to see that over 440 people are subscribed to me on Bloglines. Jeremy Zawodny has about 4215 subscribers across his feeds. So I hit over 100 milli-Zawodnys in just a month–woohoo! 🙂

I have a few topics I plan to talk about, but let me know what you’d like to see. More gadget posts? (“Matt: How do I install SlimServer on a Buffalo LinkStation?”) More SEO posts? (“Matt: what do you look for in a reinclusion request?”) More reviews? (“Matt: what did you think of Mike Moran’s and Bill Hunt’s new book about Search Engine Marketing? Either that, or the latest Harry Potter book. Your choice.”). More bloggy posts? (“Matt: where’s the pictures of your cat?”) Lemme know what you’d like to see.

Book Review: The Search

If you live in Silicon Valley and you’re creative, you can already find a copy of The Search, John Battelle’s chronicle of Google and the search industry. So far, I’ve uncovered one major problem: I couldn’t put it down. I remember going on a weekly walk with some other Googlers several years back and wondering why there were no books about Google. Ebay? The Perfect Store. Yahoo? Inside Yahoo: Reinvention and the Road Ahead. Amazon? 21 Dog Years : Doing Time @ Amazon.com and more recently, amazonia. Microsoft? I’ve got a whole shelf full of books about Microsoft. Google? Back then, nada. Zip. Zilch. Why, I wondered with my walking companions, hadn’t anyone written a book about us? Didn’t they know we’ve got good stories?

Now the situation is much better, from Tara and Rael’s Google Hacks to Chris Sherman’s excellent Google Power, which deserves a review in its own right. But The Search comes as close as anyone outside Google has come to getting into the Google story. Could I pick a few nits? Sure. Battelle mentions that Yahoo switched from Inktomi to Google in June 2000, but that’s when the deal was announced–the actual switchover happened over the long holiday weekend for July 4th. And would I describe a few parts of the story differently? Absolutely. But overall, it’s a fascinating snapshot of the search industry and Google in particular. I can pretty much guarantee that you’ll learn a few things you didn’t know. For example, if you want to find out the real story of where “Don’t be evil” came from, you’ll find it on page 138.

Some parts that I particularly enjoyed:

  • The book has a great layman’s guide to how a search engine crawls, indexes, and serves up the web. You’ll also hear about several interesting issues in passing, such as why the search [York] probably shouldn’t return results dominated by New York. You’ll find the definition of a SERP (search engine results page). If you’re an SEO and none of your family knows what you do, this book would make a nice gift.
  • The pre-Google history gives wonderful background. If you’re an SEO geek, you still might not know about TREC or the story of AltaVista at DEC. It’s great stuff. I really enjoyed reading the section about GoTo/Overture.
  • There’s several fun factoids: Andrei Broder reported that in 2001, 12% of queries to AltaVista were sexual. The book also quotes IBM’s WebFountain team as saying that 30 percent of the web is porn.

There’s also some things to file away and mull over. Battelle has a really interesting take on Yahoo and Google and their different approaches. It’s quite nuanced and you’ll want to read the book to get it all, but this quote was especially interesting:

Yahoo is far more willing to have overt editorial and commercial agendas, and to let humans intervene in search results so as to create media that supports those agendas…. Google sees the problem as one that can be solved mainly through technology–clever algorithms and sheer computational horsepower will prevail. Humans enter the search picture only when algorithms fail–and then only grudgingly.

A couple years ago I might have agreed with that, but now I think Google is more open to approaches that are scalable and robust if they make our results more relevant. Maybe I’ll talk about that in a future post.

Reading The Search gives you a jumpstart on understanding the search industry. If you’re an SEO, you need to pick up a copy of this book. At the same time, there are so many interesting stories left to be told about search. I wanted to hear stories about Inktomi and how Inktomi employees would eat the Habanero Hamburger (yes, I have had one). Or about Yahoo’s H=1 parameter. Or about www2.google.com and www3.google.com. But a book can only tackle so much, and The Search is a great summary of search so far. Highly recommended.

What’s an update?

(Normally I, you know, think before I post. I’m experimenting with the quick-post-with-very-little-thought technique here.)

What is an update? Google updates its index data, including backlinks and PageRank, continually and continuously. We only export new backlinks, PageRank, or directory data every three months or so though. (We started doing that last year when too many SEOs were suffering from “B.O.”, short for backlink obsession.) When new backlinks/PageRank appear, we’ve already factored that into our rankings quite a while ago. So new backlinks/PageRank are fun to see, but it’s not an update; it’s just already-factored-in data being exported visibly for the first time in a while.

Google also crawls and updates its index every day, so different or more index data usually isn’t an update either. The term “everflux” is often used to describe the constant state of low-level changes as we crawl the web and rankings consequently change to a minor degree. That’s normal, and that’s not an update.

Usually, what registers with an update to the webmaster community is when we update an algorithm (or its data), change our scoring algorithms, or switch over to a new piece of infrastructure. Technically Update Gilligan is just backlink/PageRank data becoming visible once more, not a real update. There haven’t been any substantial algorithmic changes in our scoring in the last few days. I’m happy to try to give weather reports when we do our update scoring/algo data though.

Um, that’s all I can think of regarding taxonomies of updates, so I guess I’ll publish it. 🙂

css.php