Archive for July, 2006

A word about metrics, part I

I’ve been reading the brouhaha about Hitwise’s press release about MySpace and Yahoo!, and I wanted to talk about metrics a bit.

Let me tell an Old Timey Story. When I joined in 2000, Google was a scrappy underdog search engine. Back then, Altavista was vastly more popular and reported 50 million searches a day. Google was popular among savvy webmasters and at many universities, and usage was growing quickly by word-of-mouth, but the smart folks at Google were eager for the company to be more well-known. At the time, the metrics services of the day vastly underrated the number of searches done on Google every day. Month after month, every report seems to show that Google had a tiny share of the market.

At some point, one of the metrics services (which shall remain nameless) came to Google so that we could try to reconcile our data with their claims. I wasn’t in the meeting, so afterwards I caught an engineer and asked what happened; why did our numbers differ by so much? “They solicit people to install an application for them” was the answer. “But that’s a horrible methodology!” I said. “That would get you a ton more novice users; expert users wouldn’t see the value and probably wouldn’t install the application as much.” The other engineer agreed.

That was an eyeopener for me. At the time, Google was much more popular with highly-technical users, who were less likely to show up in that metric. So while Google gained market share, that particular methodology always lagged in showing Google’s growth. In a way, it was a blessing in disguise: if competitors took the metrics at face value, they would underestimate Google and how fast it was growing. Ever since, I’ve taken every metric with a grain of salt--you have to think about underlying assumptions and limitations in the data.

Let’s do a simple exercise to see if you’ve been paying attention. Suppose someone calls you up on the phone to ask you to record what you’ve been watching on TV. “How did you choose me?” you ask. The caller says, “Oh, we go by the last four digits of your phone number.” Now, what limitations will there be in the data? People without phones will be left out in the cold. People who have two phone lines are more likely to get a call. And someone who ditched their landline for a cell phone might not get a call. That will absolutely skew the selection of people unless the group doing the survey makes special efforts.

Ah, writing down what TV you watch isn’t accurate anyway, you say. Let’s buy metrics data from TiVo--they can pinpoint exactly what their users watch! Well, where’s the flaw in that? Does everyone have a TiVo? No way! TiVo viewers skew toward the hip and smart (and moneyed). Plus some providers (Cox? Comcast? DirecTV? Dish?) may not use TiVo as much because they offer their own DVR. So TiVo’s data is biased too.

Now that you’re appropriately jaded and cynical, let’s look at something out there right now. Here’s a recent post that appeared on Podcasting News:

Nielsen: Podcasts More Popular than Blogging
July 12, 2006

Nielsen//NetRatings announced today that 6.6 percent of the U.S. adult online population, or 9.2 million Web users, have recently downloaded an audio podcast. 4.0 percent, or 5.6 million Web users, have recently downloaded a video podcast.

These figures put the podcasting population on a par with those who publish blogs, 4.8 percent ...

Okay, if you think more people podcast than blog, raise your hand. Anyone? No? The thing to notice is that Podcasting News contrasted downloaders of podcasts with producers of blogs. The headline might have been technically correct; it would probably not be correct if the headline were “Podcasting More Popular than Blogging” (notice how I turned “podcasts” into a verb?). Yet that article was at the top of Techmeme, and your average reader could easily miss the distinction.

The story has a happy ending. I went back this morning to check if it was still on Techmeme, and Scoble and another podcasting site are calling people on it. In the instance above, the Nielsen numbers may have been completely accurate, but you still have to analyze how someone takes those numbers and think critically about what claims they make (or imply).

This is long enough and I haven’t even *begun* to talk about Alexa or Hitwise, so let’s split it up. Today is Meeting Galore Day, but there will be at least a part II.

Comments (81)

Things I will never do

So I’m back at Google after about six weeks off. That probably sounds like a lot to Silicon Valley folks; it’s hard to believe that in high school we would get a break that was twice that long every summer.

I managed to knock out a lot of things I’ve been meaning to do for a while. I got the car in good shape. Saw an eye doctor and a regular doctor. Got my cholesterol checked. Turns out I’ve got more than enough of it. :) I organized my computer cables, although not as tidy as Danny likes his. I read a ton of books.

What’s more interesting to me is what I didn’t do. My current theory is that even if I were given a near-infinite amount of free time, I probably still wouldn’t:

- Read a book about financial planning or change my checking account to collect interest.
- Decorate my office.
- Learn to play the guitar.
- Or play the piano.
- Or sing.
- Play golf. Or garden.
- Learn to fly a plane.
- Learn to type on a Dvorak keyboard.

One thing I did finish, and which surprised me, is I finally pulled data off all the hard drives that have been collecting dust:

Fleet of hard drives! Attack of hard drive aliens!

Just in case you weren’t sure if I’m a geek: yes, I am. I highly recommend this AMS Venus hard drive enclosure. It’s got Firewire and USB ports. And if you pop the case off, you get a nice little tray with data and power cables:

Hard drive enclosure

It’s practically a mini-assembly line if you need to swap out different hard drive quickly. Then combine the AMS hard drive enclosure with Ubuntu Linux and it will read everything: vfat (Win95), NTFS (Windows NT), ext2/3 (Linux partitions). It will slurp through it all with a minimum of effort.

Comments (65)

Review: Ubuntu 6.06 (”Dapper Drake”)

I accidentally tripped and fell yesterday. My hand hit the keyboard along the way, and I accidentally installed Ubuntu on the way down. That’s how easy Ubuntu was to install. It was pretty much the geek equivalent of falling off a log.

Okay, not really, but it was almost that easy. Do you know what my hardest problem was? One minute I was updating security packages after the install, and the next minute the computer was off. “That sucks,” I thought. “Maybe Ubuntu isn’t ready for prime time yet.” Then I noticed that several other things had lost power, including another computer.

And I’m trying to figure out what happened when a sheepish cat walks out from behind the computers. My cat Emmy unplugged a power strip. She does stuff like this. She likes to get into boxes:
Emmy in a box
And any other nook she can find:
Emmy in a nook
After I figured out that Emmy unplugged my computer, it was a pretty simple install.

Nice Ubuntu surprises:
- Detected an ethernet connection + DHCP and hopped onto the network with no problem. Nice detection of USB keyboard/mouse, and it worked to have both a PS2 and a USB mouse attached at the same time.
- Easily found other computers on the network, including network-attached file servers running Samba.
- When I plugged in an external USB hard drive, it automatically detected and mounted it as “usbdrive,” which is great. And if the hard drive has three partitions, Ubuntu will open up one window for each partition.
- The update manager is really slick. It tells you when to download a security update to a package, and makes it really easy.
- Plenty of command-line binaries (e.g. shred), yet the menus are nicely streamlined.

Some Ubuntu issues:
- Not fully cat-proof. Yanking power to the computer in the middle of updating packages can cause problems.
- Video card/screen resolution detection wasn’t perfect. I’ve got a 24″ Dell screen at home, but the highest screen resolution I was offered was 1024 x 768.

My former Linux machine at home was a Libranet install (much respect to Jon Danzig, who died last year, plus his son Tal and the Libranet team for an awesome Linux distribution). Libranet had several ways to tweak fonts and settings, but it wasn’t fall-off-a-log easy. But after you install Ubuntu, just give EasyUbuntu a try. It installs additional software for things like MP3 and video codecs, nice fonts, and installs binary Nvidia/ATI drivers too.

Recently I’ve been using Ubuntu to pull data off all the hard drives I’ve collected over the years and to put the data in one place. I’ve been pleasantly surprised by how polished the distribution is; it’s even nicer than from a few months ago. If you’ve got a computer you can play around with, I’d give Ubuntu a try.

Comments (69)

Catching up

I’m still way behind on my email and blogreading, but I’ll go ahead and mention 2-3 things that I’ve come up to speed on.

I was happy to see that by the time I heard of some issues, they were already resolved. On June 16th, Matt Mullenweg posted that he’d been banned from Google. Happily, Matt kept updating the status as he learned more. It turns out that someone had uncovered Matt’s password by scouring the source code for a new project Matt was working on. The bad guy flipped on a privacy feature on Matt’s blog that added a “noindex” meta tag. And we know what the noindex tag does. When Matt figured this out, he removed the noindex tag and he’s back in Google now. In general, if your server is down for a few days and Googlebot can’t crawl your pages, those pages can drop out of our index. But when the pages are alive again, Google will often find the pages quickly and you should usually return to where you were before.

Ruslan Abuzant noticed what looked like a fragment of a server status page. He posted over at Digital Point Forums, and people there debated if the fragment was real or not. Yes, it was real. No, I’m not going to comment on what any of it means. :) Folks have taken steps to keep it from happening in the future, but personally, I think that we need to start including some extra settings for fun. I’d say that we should add

--initial_time_travel_wormhole=”Wednesday, December 31 1969 11:11 pm”
--use_googlepray=false
--docid_size=more-than-four-bytes
--SETI_alien_communication_port=31337
--skynet_sentience=0.33
--plane_load=snakes
--pigeonrank_seed=42
--use_mentalplex=true
--unicorn_versus_werewolf=its-on-now

Let’s see, what else. When I saw the obligatory “Google found data we didn’t want indexed” article that I missed while I was gone, I almost didn’t bother to ask around. Barry covered this story pretty well when he noted that Googlebot doesn’t go around guessing passwords. I assumed that someone left the information lying out somehow, or that there was a hyperlink out on the web with a username and password embedded in it. When I had a chance to talk to a colleague back at Google, I got a little more info though. He said he didn’t mind if I reprinted what he found:

The URL was on a server that this school district thought was password-protected. Before they took down the server, I was able to retrieve the live URL. I was getting a username/password login page with a regular Firefox user agent, but I got a server error when I changed the UA as Googlebot. I changed back to Firefox and was able to retrieve the username/password page again. It seems their document system was cloaking to Gbot, likely unbeknownst to the people who are writing us now and requesting the removal.

That’s my best guess for how the information got into Google. Of course it’s a moot point now because the urls are no longer in Google. But that’s what prompted me to write a short/sweet “How to herd Googlebot” post. If you administer a web server which has information that you don’t want to be public, it’s easier to exclude content in advance than to try to remove it from search engines later.

Again, I’m still catching up, but I’m planning to discuss at least a couple more things that I missed while I was on vacation.

Comments (65)

Bot Obedience: Herding Googlebot

I noticed a useful session at the upcoming Search Engine Strategies conference in San Jose. In exactly a month there will be a Bot Obedience class. People sometimes ask me about how to “sculpt” where Googlebot visits, and my only other post about this was pretty technical, so I’ll take a stab at a shorter, clearer post.

At a site or directory level, I recommend an .htaccess file to add password protection to part of a domain. I wrote a quick example of setting up an .htaccess file about this time last year. I’m not aware of any bot (including Googlebot) that guesses passwords, so this is quite effective at keeping content out of search engines.

At a site or directory level, I also recommend a robots.txt file. Google provides a simple robots.txt checking tool to test out files before putting them live.

At a page level, use meta tags at the top of your html page. The noindex meta tag will keep a page from showing up in Google’s index at all. This tag is great on any page that’s confidential. The nofollow meta tag will prevent Googlebot from following any outgoing links from a page. This page shows the proper syntax.

At a link level, you can add a nofollow tag on the granularity of individual links to prevent Googlebot from crawling individual links (you could also make the link redirect through a page that is forbidden by robots.txt). Bear in mind that if other pages link to a url, Googlebot may find the url through those other paths. If you can, I’d recommend using .htaccess or robots.txt (at a directory level) or meta tags (at a page level) to be safe. I’ve seen people try to sculpt Googlebot visits at the link level, and they always seem to forget and miss a few links.

If the content has already been crawled, you can use our url removal tool. This should be your last resort; it’s much easier to prevent us from crawling than to remove content afterwards (plus the content will be removed for six months). This help page discusses how to remove other types of content from Google.

Update: Vanessa Fox pointed out this Googlebot help page which covers a ton of other Googlebot questions.

Comments (73)

Next entries » · « Previous entries