Archives for April 2006

Simplifying Apache configuration?

I love the Apache web server. It’s blazingly fast, it’s ubiquitous, and it can do a ton of neat stuff. Most of the world’s webservers run on Apache. The only thing that I don’t like about Apache is some of the configuration and set-up. If you ask 10 webmasters what’s the trickiest technical thing to do, about 5-6 of them will say things like “configuring a web server to do redirects, mod_rewrite, and setting up .htaccess.” For example, WMW has a guide to changing dynamic urls to static urls with mod_rewrite, but it’s still pretty complicated. Notice intelligent people debating the finer points of regular expressions places like here, here, here, and here.

A few of us were talking about this at Google. Should we include an .htaccess tool in Sitemaps so that you don’t need a UNIX command-line to generate password hashes? Maybe a tool to take a list of desired redirects or rewrites, then output the correct syntax that you could cut and paste into a web server config file?

Then I remembered: Summer of Code 2006 just started taking project proposals! Not only is the Apache Software Foundation a mentoring organization, but they even have a wiki for project suggestions.

So if you’re a student and want to earn $4500 for hacking on some code this summer (and beef up your resume with Open Source experience *and* get more familiar with Apache), why not try to making it easier to configure Apache? A light-weight project might be a program that takes easy input and outputs the correct configuration code that can be cut-and-pasted into Apache config files. A project for a more skilled person might be directly changing Apache to allow simpler configuration. If you do propose a project to simplify Apache config files, let me know. 🙂

(Note: I didn’t actually discuss this with the Apache folks; it’s just an area where I see webmasters make mistakes. If you’re doing Summer of Code, you’ll want to chat with the Apache folks first, because they might have other things that they need more.)

I hate media

(No, I don’t hate the Media with a capital M. I hate lower-case media. Things like tapes, CDs, and DVD-R/+R/RW/RAM/RW-/RW+.)

Our cat Frank died around Thanksgiving of last year of FIP, leaving his sister Emmy a little lonely. A few weeks ago we got another cat to keep Emmy company. This new kitten is only around 10 months old, and he’ll run around and chase anything. His air time and somersaults can be quite impressive, so I wanted to get a new video camera to record some of his antics before he gets older and slower.

After spending a day checking out video cameras, I’m dismayed at how many cameras shoot video to a format that will be hard to access in a few years. Why do video cameras still use tape? It reminds me of mainframes with those big round plastic circles whirring around. You know, like this:

Ugly, nasty, skanky old tape

Geez, it’s 2006 for pete’s sake–why should I be pulling tapes in and out of what is (for the most part) a computer? For a long time, I burned data onto CDs or DVDs or tapes. If you do that, you’re always looking for a pen or a label, or trying to find the right CD. Eventually I saw the light. The light is: digital storage is the only way to go. Hard-drives or (worst case) memory cards are so much easier. And they’re cheap. I saw a 200G hard drive for $82 earlier today.

So this afternoon I ducked over to Best Buy and picked up a GZ-MG37US. It’s got a 30GB internal hard drive. I’ll let you know how it works. The Sony HDR-HC3 looked really sharp (it records at HD resolution instead of standard), but can you guess the output format: tape. I have a VHS tape I’ve been meaning to convert to DVD or MPEG for six years. Heck, I’ve got a D1 tape from grad school days and I have *no idea* how I’m going to convert that to something sane. So I’m voting with my checkbook: no more tape for me.

Crawl caching proxy

Several people have noticed content from other Google bots showing up in our main web index, and are wondering… why/how does that happen? Last week I was at WebmasterWorld Boston and I talked about this issue there, but I’d like to do a blog post about Google’s crawl caching proxy, because some people have questions about it.

First off, let me mention what a caching proxy is just to make sure that everyone’s aware. I’ll use an example from a different context: Internet Service Providers (ISPs) and users. When you surf around the web, you fetch pages via your ISP. Some ISPs cache web pages and then can serve that page to other users visiting the same page. For example, if user A requests, an ISP can deliver that page to user A and cache that page. If user B requests a second later, the ISP can return the cached page. Lots of ISPs and companies do this to save bandwidth. For example, Squid is one web proxy cache that is free and common that a lot of people have heard of.

As part of the Bigdaddy infrastructure switchover, Google has been working on frameworks for smarter crawling, improved canonicalization, and better indexing. On the smarter crawling front, one of the things we’ve been working on is bandwidth reduction. For example, the pre-Bigdaddy webcrawl Googlebot with user-agent “Googlebot/2.1 (+” would sometimes allow gzipped encoding. The newer Bigdaddy Googlebots with user-agent “Mozilla/5.0 (compatible; Googlebot/2.1; +” are much more likely to support gzip encoding. That reduces Googlebot’s bandwidth usage for site owners and webmasters. From my conversations with the crawl/index team, it sounds like there’s a lot of head-room for webmasters to reduce their bandwith by turning on gzip encoding.

Another way that Bigdaddy saves bandwidth for webmasters is by using a crawl caching proxy. I maxxed out my PowerPoint skills to produce an illustration. 🙂 As a hypothetical example, imagine if you participate in AdSense, Google fetches urls for our blog search, and Google also crawls your pages for its main web index. A typical day might look like this:

Page fetches under the old crawl

In this diagram, Service A could be Adsense and Service N could be blogsearch. As you can see, the site got 11 page fetches from the main indexing Googlebot, 8 fetches from the Adsense bot, and 4 fetches from blogsearch, for a total of 23 page fetches. Now let’s look at how a crawl cache can save bandwidth:

A crawl cache is much smarter!

In this example, if the blogsearch crawl or AdSense wants to fetch a page that the web crawl already fetched, it can get it from the crawl caching proxy instead of fetching more pages. That could reduce the number of pages fetched down to as little as 11. In the same way, a page that was fetched for AdSense could be cached and then returned to if the web crawl requested it.

So the crawl caching proxy work like this: if service X fetches a page, and then later service Y would have fetched the exact same page, Google will sometimes use the page from the caching proxy. Joining service X (AdSense, blogsearch, News crawl, any Google service that uses a bot) doesn’t queue up pages to be include in our main web index. Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, service Y wouldn’t get the page from the caching proxy. Finally, note that the crawl caching proxy is not the same thing as the cached page that you see when clicking on the “Cached” link by web results. Those cached pages are only updated when a new page is added to our index. It’s more accurate to think of the crawl caching proxy as a system that sits outside of webcrawl, and which can sometimes return pages without putting extra load on external sites.

Just as always, participating in AdSense or being in our blogsearch doesn’t get you any “extra” crawling (or ranking) in our web index whatsoever. You don’t get any extra representation in our index, you don’t get crawled/indexed any faster by our webcrawl, and you don’t get any boost in ranking.

This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didn’t know it was live. 🙂 That should tell you that this isn’t some sort of webspam cloak-check; the goal here is to reduce crawl bandwidth. Thanks to Greg Boser for noticing this, and thanks to Jensense for noticing that one of our online answers had stale info. The support team has updated that answer.

Guest post: Vanessa Fox on Organic Site Review session

I almost made it. I got through one day of a three day conference, and I was still blogging, caught up on email, and I’d checked my RSS feeds. Then on the second day, I spoke on three panels, stayed up talking SEO until 3:30, and it all crashed down. Now 80-90 emails sit unread in my inbox, and I’m behind on everything else too.

During the conference, I was talking to Vanessa Fox from the Sitemaps team. You probably know her from the Sitemaps blog, and she was also at WMW Boston. It turns out that she took lots of notes at the organic site review panel.

“Would you like to do a guest post on my blog?” I asked. “Sure, why not?” Vanessa replied. That’s cool, because my summary of that panel would have been something like:

The last time I did this panel, SEOs realized how much things like paid links could stick out like a sore thumb. In a different panel at WMW Boston, Rae Hoffman illustrated that other SEOs could easily see paid links using open tools like Yahoo’s Site Explorer, so it doesn’t even require the special tools that a search engine has. On the bright side, every site in the organic site review panel looked white-hat and had serious questions; paid links didn’t come up for discussion once during the panel.

without going into the detail of all we talked about. So I’m glad Vanessa is willing to cover it in more detail. Without further ado, here are Vanessa’s notes on that session:

I’ve been having a great time here at Pubcon Boston, talking to webmasters, getting feedback, and learning about what they’d most like from Google. I sat in on the Organic Site Reviews session, both because Google Sitemaps is a site review tool from a different perspective and because I wanted to make funny faces at Matt while he was talking.

I normally blog for Google Sitemaps, but Matt asked me if I wanted to do a guest post here (probably to keep me too busy paying attention to make funny faces at him).

The strongest point I got from the session (and I knew it already, but it became so apparent) is that you don’t need any special tools or secret knowledge to evaluate your site. Search engines want to return relevant and useful results for searchers. All you really need to do is look at your site through the eyes of your audience. What do they see when they get to your site? Can they easily find what they’re looking for? Webmasters are looking for some other secret key, but really, that’s all there is to it.

The panelists (Matt, Tim Mayer from Yahoo, Thomas Bindl from and Bruce Clay from Bruce Clay, Inc.) looked over the sites that the audience asked about and offered up advice.

Subscription-based content
Googlebot and other search engine bots can only crawl the free portions that non-subscribed users can access. So, make sure that the free section includes meaty content that offers value. If the article is about African elephants and only one paragraph is available in the free section, make sure that paragraph is about African elephants, and not an introductory section that talks about sweeping plains and brilliant sunsets. If it’s the latter, the article is likely to be returned in results for, oh say, [sweeping plains] and [brilliant sunsets] rather than [African elephants].

And compare your free section to the information offered by other sites that are ranking more highly for the keywords you care about. If your one free paragraph doesn’t compare to the content they provide, it only makes sense that those sites could be seen as more useful and relevant.

You could make the entire article available for free to users who access it from external links and then require login for any additional articles. That would enable search engine crawlers to crawl the entire article, which would help users find it more easily. And visitors to your site could read the entire article, see first-hand how useful your articles are, which would make a subscription to your site more compelling.

You could also structure your site in such a way that the value available to subscribers was easily apparent. The free content could provide several useful paragraphs on African elephants. Then, rather than one link that says something like “subscribe to read more”, you could list several links to the specific subcategories availabe in the article, as well as links to all related articles. You could provide some free and some behind a subscription (and make the distinction between the two obvious).

For instance:

African elephants — topics available once you register:
Habitat ($)
Diet ($)
Social patterns ($)

Related articles:
Asian elephants
Wildlife refuges ($)
History of elephants ($)

Sure, having all those keywords on the page might help the page return in results for those keywords, but it’s not just about search engines. Visitors to your site have a much better idea about what’s available after registration with a linking structure such as that than with “subscribe to read more”. Visitors want to know what “more” means. Ultimately, you care about users, not search engines. You just want the search engines to let the users know about your site. You want to make uses happy once they do know about it.

Flash-based sites
It’s not only Googlebot who doesn’t watch a 20 second video load before the home page comes into view. A lot of users don’t either. Some users don’t want to wait that long; other users don’t have Flash installed. If all of your content and menus are in Flash, search engines may have a harder time following the links. If you feel strongly about using Flash, just make an HTML version of the page available as well. The search engine bots will thank you. Your users will thank you. Feel free to block the Flash version from the crawlers with a robots.txt file, since you don’t need your pages indexed twice. If your home page is Flash, put the navigation outside of the Flash content. You could offer a choice on the home page so users can choose either the HTML version or the Flash version of the site. (You might be surprised at what users choose.)

Images as text and navigation
What is true for Flash is also true of images. Many users have images turned off. Try viewing your site with images turned off in your browser. Can you still see all the content and links?

Sites in general
You know what your site’s about, so it may seem completely obvious to you when you look at your home page. But ask someone else to take a look and don’t tell them anything about the site. What do they think your site is about?

Consider this text:

“We have hundreds of workshops and classes available. You can choose the workshop that is right for you. Spend an hour or a week in our relaxing facility.”

Will this site show up for searches for [cooking classes] or [wine tasting workshops] or even [classes in Seattle]?

It may not be as obvious to visitors (and search engine bots) what your page is about as you think.

Along those same lines, does your content use words that people are searching for? Does your site text say “check out our homes for sale” when people are searching for [real estate in Boston]?

Next, consider this page name:


It doesn’t take a special tool to know that the URL isn’t user-friendly. Compare it to:


But you can have too much of a good thing. It also doesn’t take a special tool to know that this page name isn’t user-friendly:


And speaking of putting a dash in URLs, hyphens are often better than underscores [Ed. Note: bolded by Matt 🙂 ]. african-elephants.html is seen as two words: “African” and “elephants”. african_elephants is seen as one word: african_elephant. It’s doubtful many people will be searching for that.

Other tips:

* Don’t break your content up too much. Users don’t like to continuously click to get the content they’re looking for. Search engines can better know what your site is about when a lot of key content is on one page. When the content is broken up too much, it’s not as easily searchable. FAQ pages don’t have to be 15 different tiny pages; often one page with 15 questions on it is better for users and search engines.
* Make sure your content is unique. Give a searcher a reason to want to click your site in the results.
* Make sure each page has a descriptive <title> tag and headings. The title of a page isn’t all that useful if every page has the same one.
* Minimize the number of redirects and URL parameters [Ed. Note: I’d keep it to 1-2 parameters if possible]. And don’t use “&id=” in the URL for anything other than a session ID. Since it generally is a session ID, we treat it as such and usually don’t include those URLs in the index.

Google isn’t secretive about these tips ( And the panelists in the session reviewed the sites using tools readily available. They looked at the sites, read through them, and clicked around. No magic needed.

Thanks, Vanessa!

Catching up…

I’m sitting in Boston’s Logan Airport after WMW, waiting for my flight back. Normally I don’t post a bunch of links, but I want to clear out my browser tabs.

– Looks like the Google Calendar API is already out? I love that they got it out so soon; it seems like the whole calendar launch was really tight, and I can’t wait to see what people do with an API.
Chris Pirillo posts interviews from SES New York.
– I’m really enjoying the resources (including a newsletter) that Google is doing for librarians. There’s a new poster that lists a bunch of search commands.
– Steve Boymel showed me a really smooth iGoogle-like personalized home page with smooth drag and drop, easy RSS adding (including the number of posts to show from a favorite blog), and color customization. Pretty nice.

Doh, they’re boarding. Hope I have no typos.