Archives for February 2007

Heading to SES London

I’m getting ready to head to SES London 2007 soon. I hope to see lots of search folks there! My wife and father-in-law are coming too, which is practically a recipe for interesting hijinks. 🙂

I’ve been trying to get on a London sleep schedule by getting up earlier and earlier each morning, and varying my caffeine dosage. The day of the flight, I’m planning to wake up at 3am (11am London time) and go the whole day without caffeine. If I can sleep on the plane, I should wake up bright-eyed and bushy-tailed in London. Either it will work out brilliantly, or my sleep schedule will crash and burn. I’ll let you know how it goes. I’m also really psyched about visiting the Google Dublin office (and Ireland, for that matter) for the first time. 🙂

If you think SES London might be fun, there’s still time to register.

Update: And I’m in London. I got up at 4:30 a.m. instead of 3 a.m., but my wacky get-up-early, sleep-on-the-plane, slam-a-Red-Bull-in-London plan actually worked. I was up till midnight GMT and now I need to worry about not getting up too late. 🙂 Sunday we walked around, stumbled on the BAFTAs in Covent Garden (congrats, Helen Mirren!), and had a good time. Today I plan to catch up on work a bit and do touristy things before the conference starts tomorrow.

Review: Yahoo pipes

I’ll chime in that I think Yahoo pipes is a really neat idea. As every decent UNIXhead knows, pipes let you combine small command-line tools easily by routing the output of one tool into the input of another tool. For example “cat census-names | cut -d’,’ -f2 | sort | uniq -c | sort -rg” might take a list of peoples’ names, extract just the last names, sort the list, unique-ify the list and produce a count of how many times each name occurred, then sort by the biggest number. Voilà, from a raw list of names you’ve now got a list of the most popular last names. Pipes are one of the things that makes Linux/Unix roxor.

The idea of Yahoo Pipes, as far as I can tell from a quick look, is to allow that same pipe behavior on RSS feeds. The system also outputs RSS urls. There are operators like sorting, counting, truncating, etc. If you wanted to make a mash-up of different feeds, this would be a neat way to prototype it.

What’s an example use case? Well, I’ve got one right here in my back pocket. 🙂 Suppose you had multiple RSS calendar feeds, and wanted to combine those multiple feeds into one feed. Yahoo Pipes has a “union” operator, so I’m assuming you could pipe in (say) four feeds and get back one feed that was a superset of all the items that you fed in. A couple months ago, I went searching for a package that did set operations on RSS feeds (specifically looking to combine multiple different calendar feeds), and didn’t really see anything great at that point. This tool would solve that problem.

I took it for a test drive right after Jeremy posted about it and hit an error message partway through, but I’m sure they’ll get it smoothed over pretty quickly. I was able to save a “module” (which appears to be a little chunk of pipe processing which is connected to an RSS output). Then you can click publish and you get an obfuscated url back. I tested the obfuscated url and it generates RSS just fine; the test module I made is safely tucked into Google Reader now. This is the closest to fun that RSS has ever been for me, but my eyes glaze over at the sight of XML. 😉

It isn’t terribly hard to do operations on RSS feeds, but Yahoo Pipes
– has a fun UI for playing with slicing and dicing feeds
– doesn’t require a ton of work to save/publish modules (you do need a Yahoo ID though)
– produces easy-peasy urls that output RSS, so those urls could in turn be used by other people or in larger modules. Bonus geek points for staying true to the Unix pipe idea in that way.

O’Reilly has a good write-up as well. Congrats to the Yahoo folks that built Pipes. Nice stuff. 🙂

Google provides backlink tool for site owners

One of the common requests I hear from webmasters is “Why doesn’t Google show me most or all of my backlinks?” Well, as of today, Google’s webmaster console will now let you see your site’s backlinks. Major props to the webmaster console team for this new feature. A few things to know:

– The backlink tool doesn’t show 100% of the backlinks from Google yet, but I expect the number of links that are available to grow.
– In particular, for my site I was easily able to see more than 10x more links in this new tool than the link: command gave me. The link: command has always returned a small fraction of the backlinks that Google knows about, mainly for historical reasons (e.g. limited disk space on the machines that served up “link:” data).
– You can download the backlinks in a really nice CSV format, suitable for slicing and dicing and other analysis. I believe you can export up to a million backlinks if your site has that many backlinks. 🙂
– Do not assume just because you see a backlink that it’s carrying weight. I’m going to say that again: Do not assume just because you see a backlink that it’s carrying weight. Sometime in the next year, someone will say “But I saw an insert-link-fad-here backlink show up in Google’s backlink tool, so it must count. Right?” And then I’ll point them back here, where I say do not assume just because you see a backlink that it’s carrying weight. 🙂

I’m sure that there was more that I wanted to say, but why don’t people start playing with it and give feedback or post backlink tool-related questions? I know that the webmaster team reads to get feedback over here too; congrats again to that entire team for providing this. If you want to start browsing your site’s backlinks, sign up for Google’s webmaster console now.

Quick February hits

The website for this year’s Superbowl stadium,, was recently hacked so that visitors to that page with unpatched Windows computers downloaded a keylogger and malware that allows full backdoor control. It’s fixed now, but it’s a good idea to make sure whatever computer you use to browse is patched. As a webmaster, hacked sites are shaping up to be an issue that everyone will have to be savvy about. Try to make sure your servers are patched and that you have backups.

Philipp Lenssen had a good post with his predictions for future generations of search. Among the predictions are
– combining data sources from web pages to satellite imagery to speech-to-text transcriptions
– personalization that works really well
– better results from deeper AI

Philipp was proven prescient a couple hours later, when Google mentioned that it will provide personalized search results for signed-in users. You can always sign out to get the “generic” results, but I think users will see a nice win from personalization down the road.

Strangely enough, Gord was also taking the psychic pills recently. Check out his post about personalization, and especially his interesting take on what 2007 will bring from different engines in the area of personalization.

Next, I missed this the first time around, but JLH did a neat post graphing the usage of Google’s webmaster discussion group. As I understand it, the intent of this group is to allow users to help other users. Googlers chime in on the group to give guidance on some topics, but one of the big benefits is the ability to get suggestions from other users and feedback on different topics.

Another interesting url is Feed the Bot. The footer notes “This website is not affiliated with Google Inc.” but I was impressed by the way that the site explores Google’s webmaster guidelines and tries to explain them in depth.

Finally, if you’ve missed my smiling mug on videos recently, I recently did an interview with Patrick Norton (he of the Screensavers show) on DL.TV. DL.TV stands for Digital Life and it’s a weekly video show on the web that’s custom-made for tech fans. Update: This DL.TV episode has some more tips from me.

Patrick asked for power-user tips for searching Google, so I brainstormed for a half-hour before we taped. There were enough tips that they stuck to Q&A for this video and they’ll include the actual search tips in the next show. At any rate, enjoy. 🙂

Oh, and I updated the “ways people can examine what you’re up to” part of this previous post. I added the text “Update: Some people are even willing to go through the internal text files that you provide for translation/localization.”

Better click tracking with Auto-tagging

Okay, I’m curious about something. When Google wrote a 17 page white paper about flaws in click fraud studies, how many people here read it from start to finish? If you didn’t get a chance to read it back then, you’re in luck. Shuman Ghosemajumder, a product manager at Google, summarizes the high-order bits in two posts, here and here. The two paragraphs that stood out to me were:

Here’s the problem: web logs, whether generated by an advertisers, or by third-party code on an advertiser’s site, cannot directly track ad clicks. Instead, they track visits to a special landing page URL on the advertiser’s site (e.g. ) as a proxy for how many ad clicks occurred. The assumption they’re relying upon is that each visit to that URL corresponds to a unique click, and vice versa. But in practice this is not the case. Once a user visits that page, they often browse through the site, navigating through sub pages, and then return to the original landing page by hitting the back button. When the landing page is reloaded in the browser, it appears in the web log as though additional ad “clicks” are occurring. Google can count ad clicks reliably as a click on a Google ad will cause the web browser to contact Google and then we redirect it to the advertiser’s landing page. A reload of the advertiser’s landing does not contact Google again. In addition, the referrer URL which is passed by the browser when users hit the back button is actually the original referrer URL (which says the page came from an ad click) which gets cached, so there is no analysis which can be done based on logs alone which can resolve this. This is where the fictitious clicks come from. ….

So is there a solution to this? Yes. Third-party analytics (not click fraud) firms have been aware of the page reload issue for many years, and generally use redirects (rather than web log based tracking) to avoid it. If one is tied to using web site logs (or landing page code generating logs) however, the only solution is to use the AdWords auto-tagging feature. Auto-tagging has been available since 2005, and is a feature which appends a unique ID to the landing page URL for every click, so that the cases of (a) multiple clicks and (b) multiple reloads of the landing page can be easily distinguished.

I think Shuman did a really good job summarizing that logs alone can’t be accurate. To help me visualize it, I tried to draw a picture:

Path of clicks with autotagging turned on

In my diagram, a user does the following
A) clicks on a Google ad and arrives at an advertiser’s landing page
B) hits the reload button
C) navigates to a different page
D) hits the back button

Please pardon my utter lack of artistic skills. 🙂 If I’m reading Shuman’s post correctly, events A (the click on an ad), B (reloading the page), and D (hitting the back button) can show up in logs as accesses to the landing page. Because in the logs those accesses look like ad clicks, it might look like one IP address is clicking an ad three times.

So how can you tell real ad clicks from reloads/back-button events? Use Auto-tagging, which is a feature that Google has offered since 2005 and that I don’t think any other major search engine offers. What does auto-tagging do? Every ad click from Google gets tagged with a unique id. So if your landing page was “” and you turned on Auto-tagging, an ad click to that page would look like “”

Want to know how many unique ad-clicks were delivered to your site by Google? Just count the unique gclid parameters. And if I see the unique id “COasyKJXyYECFRlvMAodRFXJ” show up three times in my log, I know that Google charges me at most once for that unique id (they mention that in the 17 page white paper). I hope Shuman’s post or the diagram above makes it clear that just counting accesses to your ad landing pages in your logs will never give an accurate ad-click count. For example, studies in the 1990s found that the back button accounted for 30-40% of all navigation events. If you turn on Autotagging (which is enabled by default when you link your AdWords account with Google Analytics, or you can turn it on without signing up for Analytics), then you don’t need to worry about reloads or the back button (or opening new windows in IE).

I’m happy to add the disclaimer that I work on webspam in the search quality group, so I’m not an expert on pay-per-click advertising or invalid clicks. If I’ve said anything incorrect in this post, let me know and I’ll happily correct it. But if you’re using AdWords, I would definitely recommend turning on Auto-tagging.

By the way, if this post was at all interesting, I’d recommend checking out that white paper (pdf link). This time start on page 12 instead of page 1. 🙂

Update: A good post over at the AdWords blog provides actionable information about exactly how to report suspicious traffic, as well as some answers to common questions/concerns.