Archives for June 2009

Watch my site review session from Google I/O

At Google I/O a few weeks ago I did a site review session with fellow Google colleagues Brian White and Greg Grothaus. The video from that session is live now and I’ll include it below:

About 38 minutes in, the session morphed into a general Q&A. So even if you don’t care about site reviews, the Q&A might be interesting to you. Video aren’t perfect (for example, it’s much harder for someone watching a video to skim quickly). But I love that I can do a two-minute video by just talking for two minutes. ๐Ÿ™‚ Compare that to any blog post which seems to take me at least an hour.

P.S. If you like this session, you might be interested to know that most Google I/O sessions were recorded and are available on video. For example, one of my favorite sessions was watching Aaron Boodman (author of Greasemonkey) talk about how to write extensions for Chrome. The amount of information available from the full session list is pretty amazing. That’s not even counting the Google Wave announcement, which has been viewed about 2.5 million times.

PageRank sculpting

People think about PageRank in lots of different ways. People have compared PageRank to a “random surfer” model in which PageRank is the probability that a random surfer clicking on links lands on a page. Other people think of the web as an link matrix in which the value at position (i,j) indicates the presence of links from page i to page j. In that case, PageRank corresponds to the principal eigenvector of that normalized link matrix.

Disclaimer: Even when I joined the company in 2000, Google was doing more sophisticated link computation than you would observe from the classic PageRank papers. If you believe that Google stopped innovating in link analysis, that’s a flawed assumption. Although we still refer to it as PageRank, Google’s ability to compute reputation based on links has advanced considerably over the years. I’ll do the rest of my blog post in the framework of “classic PageRank” but bear in mind that it’s not a perfect analogy.

Probably the most popular way to envision PageRank is as a flow that happens between documents across outlinks. In a recent talk at WordCamp I showed an image from one of the original PageRank papers:

Flow of PageRank

In the image above, the lower-left document has “nine points of PageRank” and three outgoing links. The resulting PageRank flow along each outgoing link is consequently nine divided by three = three points of PageRank.

That simplistic model doesn’t work perfectly, however. Imagine if there were a loop:

A closed loop of PageRank flow

No PageRank would ever escape from the loop, and as incoming PageRank continued to flow into the loop, eventually the PageRank in that loop would reach infinity. Infinite PageRank isn’t that helpful ๐Ÿ™‚ so Larry and Sergey introduced a decay factor–you could think of it as 10-15% of the PageRank on any given page disappearing before the PageRank flows along the outlinks. In the random surfer model, that decay factor is as if the random surfer got bored and decided to head for a completely different page. You can do some neat things with that reset vector, such as personalization, but that’s outside the scope of our discussion.

Now let’s talk about the rel=nofollow attribute. Nofollow is method (introduced in 2005 and supported by multiple search engines) to annotate a link to tell search engines “I can’t or don’t want to vouch for this link.” In Google, nofollow links don’t pass PageRank and don’t pass anchortext [*].

So what happens when you have a page with “ten PageRank points” and ten outgoing links, and five of those links are nofollowed? Let’s leave aside the decay factor to focus on the core part of the question. Originally, the five links without nofollow would have flowed two points of PageRank each (in essence, the nofollowed links didn’t count toward the denominator when dividing PageRank by the outdegree of the page). More than a year ago, Google changed how the PageRank flows so that the five links without nofollow would flow one point of PageRank each.

Q: Why did Google change how it counts these links?
A: For one thing, some crawl/indexing/quality folks noticed some sites that attempted to change how PageRank flowed within their sites, but those sites ended up excluding sections of their site that had high-quality information (e.g. user forums).

Q: Does this mean “PageRank sculpting” (trying to change how PageRank flows within your site using e.g. nofollow) is a bad idea?
A: I wouldn’t recommend it, because it isn’t the most effective way to utilize your PageRank. In general, I would let PageRank flow freely within your site. The notion of “PageRank sculpting” has always been a second- or third-order recommendation for us. I would recommend the first-order things to pay attention to are 1) making great content that will attract links in the first place, and 2) choosing a site architecture that makes your site usable/crawlable for humans and search engines alike.

For example, it makes a much bigger difference to make sure that people (and bots) can reach the pages on your site by clicking links than it ever did to sculpt PageRank. If you run an e-commerce site, another example of good site architecture would be putting products front-and-center on your web site vs. burying them deep within your site so that visitors and search engines have to click on many links to get to your products.

There may be a miniscule number of pages (such as links to a shopping cart or to a login page) that I might add nofollow on, just because those pages are different for every user and they aren’t that helpful to show up in search engines. But in general, I wouldn’t recommend PageRank sculpting.

Q: Why tell us now?
A: For a couple reasons. At first, we figured that site owners or people running tests would notice, but they didn’t. In retrospect, we’ve changed other, larger aspects of how we look at links and people didn’t notice that either, so perhaps that shouldn’t have been such a surprise. So we started to provide other guidance that PageRank sculpting isn’t the best use of time. When we added a help page to our documentation about nofollow, we said “a solid information architecture โ€” intuitive navigation, user- and search-engine-friendly URLs, and so on โ€” is likely to be a far more productive use of resources than focusing on crawl prioritization via nofollowed links.” In a recent webmaster video, I said “a better, more effective form of PageRank sculpting is choosing (for example) which things to link to from your home page.” At Google I/O, during a site review session I said it even more explicitly: “My short answer is no. In general, whenever you’re linking around within your site: don’t use nofollow. Just go ahead and link to whatever stuff.” But at SMX Advanced 2009, someone asked the question directly and it seemed like a good opportunity to clarify this point. Again, it’s not something that most site owners need to know or worry about, but I wanted to let the power-SEOs know.

Q: If I run a blog and add the nofollow attribute to links left by my commenters, doesn’t that mean less PageRank flows within my site?
A: If you think about it, that’s the way that PageRank worked even before the nofollow attribute.

Q: Okay, but doesn’t this encourage me to link out less? Should I turn off comments on my blog?
A: I wouldn’t recommend closing comments in an attempt to “hoard” your PageRank. In the same way that Google trusts sites less when they link to spammy sites or bad neighborhoods, parts of our system encourage links to good sites.

Q: If Google changed its algorithms for counting outlinks from a page once, could it change again? I really like the idea of sculpting my internal PageRank.
A: While we can’t ever say that things will never change in our algorithms, we do not expect this to change again. If it does, I’ll try to let you know.

Q: How do you use nofollow on your own internal links on your personal website?
A: I pretty much let PageRank flow freely throughout my site, and I’d recommend that you do the same. I don’t add nofollow on my category or my archive pages. The only place I deliberately add a nofollow is on the link to my feed, because it’s not super-helpful to have RSS/Atom feeds in web search results. Even that’s not strictly necessary, because Google and other search engines do a good job of distinguishing feeds from regular web pages.

[*] Nofollow links definitely don’t pass PageRank. Over the years, I’ve seen a few corner cases where a nofollow link did pass anchortext, normally due to bugs in indexing that we then fixed. The essential thing you need to know is that nofollow links don’t help sites rank higher in Google’s search results.

Add Custom Search to any site in two minutes

By the way, you might have missed it at Google I/O, but the Custom Search Engine team has made it really easy to add custom search to any site. Google recently introduced Web Elements, which are simple snippets of code you can copy/paste into your site’s HTML.

From the Custom Search Element web page, I copied the code. Then in the WordPress control panel under Appearance->Widgets, I clicked to add a new “Text” widget, changed the title to “Search (CSE)” and pasted the code into the box:

Add Custom Search

Click “Done” and then “Save Changes” and that’s it! No need to register, sign up for anything, get a user ID, or anything like that. You can see the result on the right-hand sidebar of my blog.

Android barcode scanner in 6 lines of Python code

After my last video about using a barcode scanner to add and search books in your library, I was feeling pretty happy. Bar code scanners are pretty cheap–mine cost about $65. But then Google released the Android Scripting Environment (ASE) and it turns out that you don’t even need a bar code scanner. Instead, you can use an Android phone such as the G1.

Just as a proof-of-concept, here’s a barcode scanner written in six lines of Python code:

import android
droid = android.Android()
code = droid.scanBarcode()
isbn = int(code[‘result’][‘SCAN_RESULT’])
url = “” % isbn
droid.startActivity(‘android.intent.action.VIEW’, url)

Thanks to fellow Googler Vijayakrishna Griddaluru for sending me this sample code. Visiting the resulting url offers the option to add that book to your library:

Android bar code scanner

Pretty easy, huh? You can read all about the new scripting environment. Not only can you scan bar codes, you can use text-to-speech, make phone calls, send text messages, read sensor data, and find your location–all from easy scripts. One person wrote a script to go into silent mode when the phone is placed screen-down on the table. It took less than 20 lines of code, and that’s including comments!

The Android Scripting Environment should make fun projects even easier. Brad Fitzpatrick wrote about using his Android phone to open his garage door automatically when his motorcycle gets close to home. Now those sorts of projects are even easier to write. ๐Ÿ™‚

Search your bookshelf with a $65 barcode scanner

(Okay, if TechCrunch wrote about my video then I should probably at least do a blog post too.)

Last year I suggested potential Summer of Code projects and one of my favorite suggestions was “How about a good open-source program to manage your book library? Something like the Delicious Library program, but that works with Linux?” In the blog comments, Colin Colehour left an excellent comment: “Matt, Canโ€™t you use Google Books to keep track of your book library at home? You can add books that you own to the โ€˜my libraryโ€™ list and then export that as an xml file and they have RSS feeds.”

The suggestion was so obvious that I smacked my head. Why install software at all when a website will store the data for you? The only problem was how to tell Google which books I own. Well, there’s a neat hack for this too: Amazon carries the Adesso NuScan 1000 bar code scanner for $65.44 with free shipping. I’m sure you can get barcode scanners for cheaper (anyone remember the CueCat scanner that was free?), but the Adesso had good reviews.

With that, adding your books to Google’s My Library feature is simplicity itself–the Google Books team has tweaked the workflow so that you can barcode scan and add lots of books very quickly. Here’s the video to demonstrate:

Why would you record which books you own in the first place? The immediate reason is that you can run full-text searches against the books in your library. That’s right: just by scanning bar codes, you can search over the text of books you own. Down the road, I can easily imagine other uses. Wouldn’t it be great if you could upload your list of books to Amazon, and it would automatically suggest other books you should read? Or avoid suggesting books that you already own? Josh Lowensohn mentions another great reason to do this: it creates a record for insurance purposes.

Once you have your book list, there are social networks for book lovers such as Goodreads and LibraryThing. And please note: this isn’t the only way to scan your books. Delicious Library 2 is $40 commercial software for the Mac that can use your Mac’s built-in webcam.

Special thanks to Michael ‘Wysz’ Wyszomierski for recording and producing this video. I love that he showed the computer’s screen and showed an “action shot” of scanning the books.