Archives for December 2006

Page view metrics? Bah, humbug!

(A personal, non-work mini-rant about page view metrics.)

I want to come to Yahoo’s defense about something. A recent spate of reports says that Yahoo has been surpassed by various companies in terms of page views. Why is that relatively bogus? Because of Yahoo’s switch to AJAX for its mail. According to Alexa data, 49% of Yahoo visitors go to mail.yahoo.com. Everyone knows that I take Alexa data with a grain of salt, and that 49% fraction may be high, but Yahoo definitely gets a lot of traffic from Yahoo Mail. Yahoo’s new mail system uses AJAX. And how do the metrics companies handle AJAX? Typically, not well.

At SES San Jose recently, I asked a metrics company about how they count AJAX and the metrics person got a deer-in-the-headlights look on their face. What does your traffic look like if 30-50% of your page views are suddenly converted to AJAX where a page never really reloads? Your traffic doesn’t change and you may have happier and more users, but your metrics will plummet. By the way, that’s probably why you saw this post recently on the Yahoo! Anecdotal blog talking about page views.

If you think about it, you’ll see why AJAX breaks everything. Page views are “easy.” Take all your GETs, POSTs, HEADs, or whatever and add them up. Maybe do something smart about images/CSS/JS/framesets and everyone’s metrics will roughly agree. Now, what’s the relative value of a Maps AJAX request vs. a Gmail AJAX request? Do you see why your head will hurt if you try to come up with good metrics for an AJAX site that isn’t yours? By the way, Evan Williams pointed out months ago that Google Analytics can help you track your own AJAX applications. But that’s different: if you’re writing your own AJAX site, you know what events matter. But metrics companies won’t know which AJAX events should be counted unless they read your code carefully. And that’s assuming that the ISP they’re buying data from gives them AJAX requests in the first place.

What are the takeaways from this?
1. Remember that post that said Gmail had a 2.5% market share? Shortly afterwards, you started to see a “Google succeeded in search, but hasn’t done as well in other areas” meme. I wonder if we should reconsider the origin of that notion.
2. Everybody ask your favorite metrics company how they handle AJAX. Start with “exactly what level of AJAX data access do ISPs sell to you?”.
3. If you’re doing a start-up and want impressive page view metrics, stay the hell away from AJAX.
4. If you would even *for one second* consider staying away from AJAX for the sake of impressive metrics, you’re running your start-up ass-backwards.

Okay, I’m out of steam. If I had more steam, I’d rip into the idea of an hours spent searching each month as a good metric for search engines. 🙂

Explaining algorithm updates and data refreshes

A thread on WMW started Dec. 20th asking whether there was an update, so I’m taking a break from wrapping presents for an ultra-quick answer: no, there wasn’t.

To answer in more detail, let’s review the definitions. You may want to review this post or re-watch this video (session #8 from my videos). I’ll try to summarize the gist in very few words though:

Algorithm update: Typically yields changes in the search results on the larger end of the spectrum. Algorithms can change at any time, but noticeable changes tend to be less frequent.

Data refresh: When data is refreshed within an existing algorithm. Changes are typically toward the less-impactful end of the spectrum, and are often so small that people don’t even notice. One of the smallest types of data refreshes is an:

Index update: When new indexing data is pushed out to data centers. From the summer of 2000 to the summer of 2003, index updates tended to happen about once a month. The resulting changes were called the Google Dance. The Google Dance occurred over the course of 6-8 days because each data center in turn had to be taken out of rotation and loaded with an entirely new web index, and that took time. In the summer of 2003 (the Google Dance called “Update Fritz”), Google switched to an index that was incrementally updated every day (or faster). Instead of a monolithic monthly event, the Google would refresh some of its index pretty much every day, which generated much smaller day-to-day changes that some people called everflux.

Over the years, Google’s indexing has been streamlined, to the point where most regular people don’t even notice the index updating. As a result, the terms “everflux,” “Google Dance,” and “index update” are hardly ever used anymore (or they’re used incorrectly 🙂 ). Instead, most SEOs talk about algorithm updates or data updates/refreshes. Most data refreshes are index updates, although occasionally a data refresh will happen outside of the day-to-day index updates. For example, updated backlinks and PageRanks are made visible every 3-4 months.

Okay, here’s a pop quiz to see if you’ve been paying attention:

Q: True or false: an index update is a type of data refresh.
A: Of course an index update is a type of data refresh! Pay attention, I just said that 2-3 paragraphs ago. 🙂 Don’t get hung up on “update” vs. “refresh” since they’re basically the same thing. There’s algorithms, and the data that the algorithms work on. A large part of changing data is our index being updated.

I know for a fact that there haven’t been any major algorithm updates to our scoring in the last few days, and I believe the only data refreshes have been normal (index updates). So what are the people on WMW talking about? Here’s my best MEGO guess. Go re-watch this video. Listen to the part about “data refreshes on June 27th, July 27th, and August 17th 2006.” Somewhere on the web (can’t remember where, and it’s Christmas weekend and after midnight, so I’m not super-motivated to hunt down where I said it) in the last few months, I said to expect those (roughly monthly) updates to become more of a daily thing. That data refresh became more frequent (roughly daily instead of every 3-4 weeks or so) well over a month ago. My best guess is that any changes people are seeing are because that particular data is being refreshed more frequently.

Light blogging

My family arrived tonight from Kentucky to stay for the next several days, so the blogging will be verrrrrrry light for the rest of 2006.

Funny spam email

I first got to know Gary Stock because of his Googlewhack site (a “googlewhack” is a pair of words which, when typed into Google, return exactly one web page). As co-founder of Nexcerpt, he’s a good guy to know in general. Recently Gary got a spam email with my name in the subject field. It looked like this (reproduced with permission):

Subject: Google’s Matt Cutts talks about buying links.
Date: Thu, 7 Dec 2006 21:39:24 -0500
From: “Johns D. Gideon” <xxx @xxxxxxxxxx.xx.xx>
To: <xxxxxxxxx @xxxxxxxx.xxx>

Here’s a funny Google song.
AskJeeves has crawled thousands of pages, while indexing none of them.
Among the top five search engines, Microsoft Corp. Among the top five
search engines, Microsoft Corp.
The webmaster should add text links for the site navigation or he should
change the whole navigation from image links to text links. Getting
listed on Google is possible. The more clearly you structure your text,
the easier it is for search engines to process it.
He also said that the links to your web site should look natural.
There are dozens of web page factors that can influence your search
engine rankings. If you want to have rankings, your web site must have
both.
Back to table of contents – Visit Axandra. in claims to be the world’s
widest Web search. Google finds only one backlink to the web site.
Yahoo’s global usage share remains stable. What are paid links?
You then just have to wait until it comes out of the sandbox.
Google seeks to stop Microsoft from suing new hire. Whether you Googled
for Paris Hilton, a stock tip or a gift for Mom, you’v
Getting listed on Google is possible. What does this mean to your Google
rankings? They usually won’t bring you visitors that are interested in
what you have to offer.
Google to open research and development center in China. This
distribution partnership is probably only the start.
Has your web site been dropped from Google and you don’t know why? That
will help search engines to classify the page.
The whole web site doesn’t contain links to related web sites.
This week, we’re taking another look at keywords and search engine optimization.
You cannot see it at all in the picture above. Otherwise, they wouldn’t
be listed on the first result page.
Google doesn’t expect that new web sites have a large number of links.
But most of all, it wants to be successful.

You know you’re doing something right when you get included as gibberish in spam emails. 🙂

Call for Papers: AIRWeb 2007

I’m on the 2007 program committee for AIRWeb, which is the workshop on Adversarial Information Retrieval that will be held May 8th 2007, in conjunction with the WWW conference up in Banff. One big change from last year is a labeled testset that people can use for their webspam research. I’ll include the call for papers:

CALL FOR PAPERS
Third International Workshop on
Adversarial Information Retrieval on the Web
http://airweb.cse.lehigh.edu/2007/

-and-

CALL FOR CHALLENGE SUBMISSIONS
Track I of the Web Spam Challenge 2007
http://webspam.lip6.fr/

==================================================================
IMPORTANT DATES

15/Feb/2007 : Deadline for research articles
30/Mar/2007 : Deadline for challenge submissions
8/May/2007 : Workshop at the WWW 2007 conference in Banff, Canada
==================================================================

Contents:

1. AIRWeb’07 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee

1. AIRWEB’07 TOPICS

Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from collections
wherein a subset has been manipulated maliciously. On the Web, the
predominant form of such manipulation is “search engine spamming” or
spamdexing, i.e., malicious attempts to influence the outcome of ranking
algorithms, aimed at getting an undeserved high ranking for some items
in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest include,
but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging

Proceedings of the workshop will be included in the ACM Digital Library.
Full papers are limited to 8 pages; work-in progress will be permitted 4
pages.

For more information, see http://airweb.cse.lehigh.edu/2007/

2. WEB SPAM CHALLENGE

This year, we are introducing a novel element: a Web Spam Challenge for
testing web spam detection systems. We will be using the WEBSPAM-UK2006
collection for Web Spam Detection http://www.yr-bcn.es/webspam

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set of
features extracted from the contents and links in the collection, which
may be used by the participant teams in addition to any automatic
technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2007
workshop.

For more information, see http://webspam.lip6.fr/

3. TIMELINE

– 7 February 2007: E-mail intention to submit a workshop paper
(optional, but helpful)
– 15 February 2007: Deadline for workshop paper submissions
– 15 March 2007: Notification of acceptance of workshop papers
– 30 March 2007: Camera-ready copy due
– 30 March 2007: Challenge submissions due
– 8 May 2007: Date of workshop

4. ORGANIZERS AND PROGRAM COMMITTEE

Organizers

– Carlos Castillo, Yahoo! Research
– Kumar Chellapilla, Microsoft Live Labs
– Brian D. Davison, Lehigh University

Program Committee

– Einat Amitay, IBM Research
– Andras Benczur, Hungarian Academy of Sciences
– Andrei Broder, Yahoo! Research
– Soumen Chakrabarti, Indian Institute of Technology Bombay
– Paul-Alexandru Chirita, University of Hannover
– Tim Converse, Yahoo!
– Nick Craswell, Microsoft Research
– Matt Cutts, Google
– Ludovic Denoyer, University Paris 6
– Aaron D’Souza, Google
– Dennis Fetterly, Microsoft Research
– Tim Finin, University of Maryland
– Edel Garcia, Mi Islita.com
– Natalie Glance, Nielsen BuzzMetrics
– Antonio Gulli, Ask.com
– Zoltan Gyongyi, Stanford University
– Monika Henzinger, Google & Ecole Polytechnique Federale de Lausanne (EFPL)
– Jeremy Hylton, Google
– Ronny Lempel, IBM Research
– Mark Manasse, Microsoft Research
– Gilad Mishne, University of Amsterdam
– Marc Najork, Microsoft Research
– Jan Pedersen, Yahoo!
– Tamas Sarlos, Hungarian Academy of Sciences
– Erik Selberg, Microsoft Search Labs
– Mike Thelwall, University of Wolverhampton
– Andrew Tomkins, Yahoo! Research
– Matt Wells, Gigablast
– Baoning Wu, Lehigh University
– Tao Yang, Ask.com

css.php