Archives for May 2006

Vacation books?

Okay, I’m looking for fun, light reading for my vacation. I don’t want search stuff, I don’t want heavy reading, I don’t want geopolitics or history.

Things like The Curious Incident of the Dog in the Night-Time. Or Terry Pratchett. Or early William Gibson. Cheesy cyberpunk if they don’t get the computer stuff too wrong. Neil Gaiman. Transmetropolitan.

Lazyweb, I invoke you! What should I read on vacation?

Make a favicon

Okay, I recently made the cheesiest favicon evah. (A favicon is that little icon beside a url in the address bar of your browser.) gives a great explanation of how to make your own favicon.

The brief version on Windows XP is:
– Download Paint.NET.
– Resize your blank image to 16×16.
– Change your window size to 3200% (it’s a drop-down).
– Toggle grid mode on (right next to the window size drop-down).
– Make a dinky image with 16 or so colors.
– Save as favicon.png (in PNG format). It might look like this: My silly little ico in PNG format
– Use png2ico (warning: zip file) with syntax like “png2ico favicon.ico –color 16 favicon.png” to make your favicon.ico file. That’s two dashes before the “color.” Silly WordPress converts it to a single long dash. Anybody know how to undo that?
– Upload the favicon.ico file to the root level of your website.

Not so hard, except for the part requiring artistic talent. That why all my doodles are geometric. πŸ™‚

Indexing timeline

Heh. I wrote this hugely long post, so I pulled a Googler aside and asked “Dan, what do you think of this post?” And after a few helpful comments he said something like, “And, um, you may want to include a paragraph of understandable English at the top.” πŸ™‚

Fair enough. Some people don’t want to read the whole mind-numbingly long post while their eyes glaze over. For those people, my short summary would be two-fold. First, I believe the crawl/index team certainly has enough machines to do its job, and we definitely aren’t dropping documents because we’re “out of space.” The second point is that we continue to listen to webmaster feedback to improve our search. We’ve addressed the issues that we’ve seen, but we continue to read through the feedback to look for other ways that we could improve.

People have been asking for more details on “pages dropping from the index” so I thought I’d write down a brain dump of everything I knew about, to have it all in one place. Bear in mind that this is my best recollection, so I’m not claiming that it’s perfect.

Bigdaddy: Done by March

– In December, the crawl/index team were ready to debut Bigdaddy, which was a software upgrade of our crawling and parts of our indexing.
– In early January, I hunkered down and wrote tutorials about url canonicalization, interpreting the inurl: operator, and 302 redirects. Then I told people about a data center where Bigdaddy was live and asked for feedback.
– February was pretty quiet as Bigdaddy rolled out to more data centers.
– In March, some people on WebmasterWorld started complaining that they saw none of their pages indexed in Bigdaddy data centers, and were more likely to see supplemental results.
– On March 13th, GoogleGuy gave a way for WMW folks to give example sites.
– After looking at the example sites, I could tell the issue in a few minutes. The sites that fit “no pages in Bigdaddy” criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling. The Bigdaddy update is independent of our supplemental results, so when Bigdaddy didn’t select pages from a site, that would expose more supplemental results for a site.
– I worked with the crawl/index team to tune thresholds so that we would crawl more pages from those sorts of sites.
– By March 22nd, I posted an update to let people know that we were crawling more pages from those sorts of sites. Over time, we continued to boost the indexing even more for those sites.
– By March 29th, Bigdaddy was fully deployed and the old system was turned off. Bigdaddy has been powered our crawling ever since.

Considering the amount of code that changed, I consider Bigdaddy pretty successful in that I only saw two complaints. The first was one that I mentioned, where we didn’t index pages from sites with less trusted links, and we responded and started indexing more pages from those sites pretty quickly. The other complaint I heard was that pages crawled by AdSense started showing up in our web index. The fact that Bigdaddy provided a crawl caching proxy was a deliberate improvement in crawling and I was happy to describe it in PowerPoint-y detail on the blog and at WMW Boston.

Okay, that’s Bigdaddy. It’s more comprehensive, and it’s been visible since December and 100% live since March. So why the recent hubbub? Well, now that Bigdaddy is done, we’ve turned our focus to refreshing our supplemental results. I’ll give my best recollection of that timeline too. Around the same time, there was speculation that our machines are full. From my personal perspective in the quality group, we have certainly have enough machines to crawl/index/serve web results; in fact, Bigdaddy is more comprehensive than our previous system. Seems like a good time to throw in a link to my disclaimer right here to remind people that this is my personal take.

Refreshing supplemental results

Okay, moving right along. As I mentioned before, once Bigdaddy was fully deployed, we started working on refreshing our supplemental results. Here’s my timeline:
– In early April, we started showing some refreshed supplemental results to users.
– On April 13th, someone started a thread on WMW to ask about having fewer pages indexed.
– On April 24th, GoogleGuy gave a way for people to provide specifics (WebmasterWorld, like many webmaster forums, doesn’t allow people to post specific site names.)
– I looked through the feedback and didn’t see any major trends. Over the next week, I gave examples to the crawl/index team. They didn’t see any major trend either. The sitemaps team investigated until they were satisfied that it had nothing to do with sitemaps either.
– The team refreshing our supplemental results checked out feedback, and on May 5th they discovered that a “site:” query didn’t return supplemental results. I think that they had a fix out for that the same day. Later, they noticed that a difference in the parser meant that site: queries didn’t work with hyphenated domains. I believe they got a quick fix out soon afterwards, with a full fix for site: queries on hyphenated domains in supplemental results expected this week.
– GoogleGuy stopped back by WMW on May 8th to give more info about site: and get any more info that people wanted to provide.

Reading current feedback

Those are the issues that I’ve heard of with supplemental results, and those have been resolved. Now, what about folks that are still asking about fewer pages being reported from their site? As if this post isn’t long enough already, I’ll run through some of the emails and give potential reasons that I’ve seen:

– First site is a .tv about real estate in a foreign country. On May 3rd, the site owner says that they have about 20K properties listed, but says that they dropped to 300 pages. When I checked, a site: query shows 31,200 pages indexed now, and the example url they mentioned is in the index. I’m going to assume this domain is doing fine now.

– Okay, let’s check one from May 11th. The owner sent only a url, with no text or explanation at all, but’s let’s tackle it. This is also a real estate site, this time about a Eastern European country. I see 387 pages indexed currently. Aha, checking out the bottom of the page, I see this:
Poor quality links
Linking to a free ringtones site, an SEO contest, and an Omega 3 fish oil site? I think I’ve found your problem. I’d think about the quality of your links if you’d prefer to have more pages crawled. As these indexing changes have rolled out, we’ve improving how we handle reciprocal link exchanges and link buying/selling.

– Moving right along, here’s one from May 4th. It’s another real estate site. The owner says that they used to have 10K pages indexed and now they have 80. I checked out the site. Aha:
Poor quality links
This time, I’m seeing links to mortgages sites, credit card sites, and exercise equipment. I think this is covered by the same guidance as above; if you were getting crawled more before and you’re trading a bunch of reciprocal links, don’t be surprised if the new crawler has different crawl priorities and doesn’t crawl as much.

– Some one sent in a health care directory domain. It seems like a fine site, and it’s not linking to anything junky. But it only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages. Hold on, digging deeper. Aha, the owner said that they wanted to kill the www version of their pages, so they used the url removal tool on their own site. I’m seeing that you removed 16 of your most important directories from Oct. 10, 2005 to April 8, 2006. I covered this topic in January 2006:

Q: If I want to get rid of but keep, should I use the url removal tool to remove
A: No, definitely don’t do this. If you remove one of the www vs. non-www hostnames, it can end up removing your whole domain for six months. Definitely don’t do this. If you did use the url removal tool to remove your entire domain when you actually only wanted to remove the www or non-www version of your domain, do a reinclusion request and mention that you removed your entire domain by accident using the url removal tool and that you’d like it reincluded.

You didn’t remove your entire domain, but you removed all the important subdirectories. That self-removal just lapsed a few weeks ago. That said, your site also has very few links pointing to you. A few more relevant links would help us know to crawl more pages from your site. Okay, let’s read another.

– Somebody wrote about a “favorites” site that sells T-shirts. The site had about 100 pages, and now Google is showing about five pages. Looking at the site, the first problem that I see is that only 1-2 domains have any links at all to you. The person said that every page has original content, but every link that I clicked was an affiliate link that went to the site that actually sold the T-shirts. And the snippet of text that I happened to grab was also taken from the site that actually sold the T-shirts. The site has a blog, which I’d normally recommend as a good way to get links, but every link on the blog is just an affiliate link. The first several posts didn’t even have any text, and when I found an entry that did, it was copied from somewhere else. So I don’t think that the drop in indexed pages for this domain necessarily points to an issue on Google’s side. The question I’d be asking is why anyone would choose your “favourites” site instead of going directly to the site that sells T-shirts?

Closing thoughts

Okay, I’ve got to wrap up (longest. post. evar). But I wanted to give people a feel for the sort of feedback that we’re getting in the last few days. In general, several domains I’ve checked have more pages reported these days (and overall, Bigdaddy is more comprehensive than our previous index). Some folks that were doing a lot of reciprocal links might see less crawling. If your site has very few links where you’d be on the fringe of the crawl, then it’s relatively normal that changes in the crawl may change how much of your site we crawl. And if you’ve got an affiliate site, it makes sense to think about the amount of value-add that your site provides; you want to provide a reason why users would prefer your site.

In March, I was able to read feedback and identify an issue to fix in 4-5 minutes. With the most recent feedback, we did find a couple ways that we could make site: more accurate, but despite having several teams (quality, crawl/index, sitemaps) read the remaining feedback, we’re seeing more a grab-bag of feedback than any burning issues. Just to be clear, I’m not saying that we won’t find other ways to improve. Adam has been reading and replying to the emails and collecting domains to dig into, for example. But I wanted to give folks an update on what we were seeing with the most recent feedback.

Can I just say..

Dude, Walt Mossberg is the man.

Review: Google Co-op

I’ve had some time to mull over what I saw at Google Press Day. Let’s review the products that were made available:

Google Co-op. See below for the review.
Google Trends. Pretty much everyone can see the appeal of this in a few seconds, and it’s fun to play with. For example, Marc Grobman noticed that people are more interested in “breakfast” on Saturday/Sunday, but “brunch” is more of a Sunday thing.
Google Desktop 4. I updated my ancient, crufty version of Google Desktop last night, and it’s nice. My favorite is the “ctrl-ctrl” shortcut to bring up a search box. I wanted to make it so that “shift-shift” toggles the sidebar and gadgets on and off, but couldn’t get that to work. The newest version is clearly more polished than my older version though.

Of the products, Google Co-op is one that I’m getting more excited about over time. A few webmasters had an initial reaction of “My head just exploded!” But several folks are digging into this more now. Barry and Detlev had an interesting conversation in the May 12th SearchCast about Barry’s implementation of a subscribe link. Barry had some really good comments about how this is self-regulating for spam. If people have to subscribe, it’s less likely to attract spam attempts. And if you overreach and try to stick useless labels on every search under the sun, it’s very easy for a subscriber to click “remove” to get rid of your labels.

Should Google Co-op make your head explode? I don’t think so. Personally, I believe that the core concept is pretty smart: it gives experts a way to augment and annotate Google’s search results. The simplest level of interaction with Google Co-op is to annotate pages, whether your field of expertise is TiVos or Kingdom of Loathing. Most people are an expert on something. πŸ™‚

I was talking to Ramanathan Guha about Google Co-op the other day, and I realized that Co-op is much deeper than that though. For example, I talked to Marti Hearst last week at a dinner, and I was reminded of her group’s research about different ways to slice and dice data. One example is that a recipe could be categorized by ingredients, type of dish (appetizer vs. dessert), or occasion. See for an example of that Hearst-ian way to access recipes in lots of ways instead of a strict hierarchy. Or this Nobel Prize winner page is another example. In Co-op, the ways to slice the data are called facets, so users aren’t constrained to a strict hierarchy to find information.

I’m still poking into Co-op, but I’m excited about the ability to merge different people’s views that share the same label. I also noticed this intriguing entry in the Topics Developer Guide:

ELIMINATE: Results with this label will not appear among the results.

I checked with Guha and the functionality sounds cool: if I had a “webspam” label with the ELIMINATE mode, then I could share a blacklist. Anyone could subscribe to me to share that blacklist. And then anyone could make their own file and just use the same label with the same semantics. Eventually, you might be able to subscribe to 9-10 people who kept a list of domains that they wanted to remove from Google’s results. Shame on me for going straight to webspam when most people will want to augment and annotate Google search rather than remove results, but it’s neat that Co-op has this functionality built in. πŸ™‚

I’m still digging into Co-op, but I’m liking what I see so far. It gives a richer vocabulary to talk about things than just links. So a site such as Quackwatch could produce some really interesting ways of labeling sites. Subscribing is easy and unsubscribing is lightweight. So there’s not too much hassle to subscribing to quite a few people. I’m looking forward to digging into Co-op more over time. πŸ™‚