Archives for July 2006

A word about metrics, part II

Okay, in a previous post I told a story about Google’s market share in early days, and mentioned that you have to think about the limitations of any measuring methodology. I briefly touched on sampling bias too. Let’s look consider sampling bias in a different arena: Alexa.

One possible source of skewing in Alexa data is a bias toward webmaster-y sites. Alexa shows how popular web sites are, so it’s natural that webmasters install the Alexa toolbar. Some do it just so that their normal day-to-day visits around the web (including their own site) are added to Alexa’s stats. The net effect is that webmaster-related sites are going to look more important to Alexa. Let’s take a look at a graph comparing mattcutts.com and ask.com:

Matt vs. Ask!

For now, let’s concentrate on the green ellipse. This is a graph of reach, which is defined as “out of one million internet users, how many of them went to mattcutts.com vs. Ask each day.” If you look at the green ellipse, it shows that I had a spike in May and Ask had a dip in June. I believe Alexa was reporting that for at least a good day for me and a bad day for Ask, I was reaching more internet users as a percentage than Ask. (Alexa folks, please correct me if I’m mis-speaking or drawing the wrong conclusion.) And I believe that I can safely say that’s not remotely close to true. I have nowhere near the reach that Ask has. 🙂

I’m clearly getting some boost from webmaster bias because so many SEOs read my blog. Am I getting a boost from anything else? Well, look at the purple ellipse in the graph above. I got a really huge spike in reach around April 20th. Why? It’s not like I said anything especially insightful that week. I think the answer is that I’m getting a bit of geek boost too.

Others have noticed this impressive jump in late April, and that some non-geek sites remained unaffected. What on earth could account for this huge (but welcome) spike in my reach graph?

Jason Striegel proposed a possible explanation: maybe Digg did it. He suggests that a Digg story about Digg overtaking Slashdot in traffic caused a bunch of Diggers to install the Alexa toolbar–enough to skew Alexa’s stats. Now the Digg story was popular about a month before the Alexa spike–maybe there’s a near-one-month wait on accepting data from new Alexa toolbar installs? It’s hard to say, but that late-April spike is definitely interesting. I haven’t seen too many other theories on that boost for geeky sites. Anyone got other ideas?

Just to be clear: Alexa is wonderful in many ways, and I love Alexa. They provide easy access to nice usage data. You just have to keep in mind possible limitations, e.g. skewing due to sampling bias. And to be fair, I grabbed this Alexa graph a couple weeks ago: I went back today and the two “Matt vs. Ask” spikes don’t cross now. Maybe Alexa did some renormalization. That does raise the issue that any metric is a bit of a black box: you need to know the raw data used compute a metric, and exactly how that metric is computed. If you don’t know that, then there are bounds to how confident you can be in a metric.

So how do you decide how much to trust a metric? One way is to find another similar metric and compare the two. For example, here’s a graph comparing reach for mattcutts.com to zawodny.com:

Matt vs. Jeremy

Ha ha! Looks like I’m trouncing him, eh? Time to do a little Google Dance? Not so fast. Let’s look at a completely different metric which should be comparable: Bloglines subscribers. My RSS feed lists 1,136 subscribers, while Jeremy lists 5,096 subscribers. So by that metric, Jeremy is destroying me. And I suspect that Bloglines subscriptions are more accurate in this case.

Now, are Bloglines subscriptions perfectly accurate? Of course not. People who talk a lot about RSS and APIs probably are more likely to have RSS subscribers, for example. Also, different feed readers will have different audiences and demographics. And I noticed that over my six-week vacation that my Bloglines subscribers numbers didn’t budge. It’s probably true that even when web surfers visit a site less often, RSS subscriber numbers would remain nearly constant, because it’s more trouble to unsubscribe in most feed readers. So drops in popularity are probably more visible from web surfers than from RSS subscribers.

What are the takeaway points so far? You should think about the limitations in any methodology: bear in mind that sampling bias can under (or over!) represent a site, for example. To be completely sure in a metric, you need to know the raw incoming data and how a metric operates on that data to produce a number. And if you want to be more confident, look for similar metrics that should roughly agree. If different metrics agree, that’s a good sign. If they disagree, you should probably be cautious.

Do you know the way to SES San Jose 2006?

Joe Morin has posted the official party list for Search Engine Strategies (SES) 2006 in San Jose. That includes Google Dance V at the Googleplex, which all SES attendees are invited to. It’s a great chance to see the Googleplex and enjoy music, food, drinks, and talking with Googlers.

At the conference itself, Danny Sullivan and Google CEO Eric Schmidt will sit down to talk at 10 a.m. on Wednesday, August 9th:

Google CEO Eric Schmidt talks with Search Engine Watch editor-in-chief Danny Sullivan about how the search leader is growing and dealing with challenges and issues in search.

Then at 11 a.m. that same day, I’ll be on a “Speaking Unofficially: Search Engine Bloggers” panel. Those are fun because it’s pretty much all Q&A, so I can leave my PowerPoint at the door. I may do some other sessions too, but San Jose is a great time to rotate in new Googlers and let them soak in the experience of chatting with webmasters, advertisers, publishers, and users.

I love the San Jose conference because I can talk to webmasters until late, then drive home and sleep in my own bed. 🙂 Who here will be at SES San Jose? I’d love to meet a few new folks this year.

Update: This always happens. As I get closer to a conference, I’m like “Oh, I forgot I promised I’d be on that panel.” So I’m planning on being at the Duplicate Content & Multiple Site Issues session on Tuesday, August 8th, at 11:15 a.m. On the plus side, I’ll just be there for Q&A (no PowerPoint again! woohoo!), and I’ll get to sit next to Tim Converse, the Yahoo! analogue of my webspam self. On the minus side, I won’t get to listen in on the Blog & Feed Search SEO or Search Arbitrage panels.

css.php