Archive for March, 2006

Googlebot: Keep out!

Okay, that last post was pretty earnest, so I feel the need to post something really technical now. At SES New York, someone asked “Why don’t you provide a parameter, like ‘?googlebot=nocrawl’ to say ‘Googlebot, don’t index this page’?”

That was a pretty good question. The short answer would be that on pages you don’t want indexed by spiders, you can add this meta tag to the page:

<META NAME=”ROBOTS” CONTENT=”NOINDEX”>

You can read more about the noindex and nofollow meta tags on our webmaster pages.

But the user specifically wanted a url parameter. I mentioned that because the parameter “id” is often used for session IDs, Googlebot used to avoid urls with “?id=(let’s say a five digit or larger number)” but that I didn’t know if that was still true. I think someone else nearby asked “Isn’t that kind of an ugly hack though?” and I had to fall back on “You asked for something that worked, not something that was pretty.” The questioner persisted, but I was out of other ways to do it, so I said I’d pass the feedback on, namely “someone wants a url parameter that’s keeps Googlebot from indexing the page.”

That question came up again today, and I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like ‘*’ in robots.txt). Here’s how:

1. Add the parameter like ‘http://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’ to pages that you don’t want fetched by Googlebot.
2. Add the following to your robots.txt:

User-agent: Googlebot
Disallow: *googlebot=nocrawl

That’s it. We may see links to the pages with the nocrawl parameter, but we won’t crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn’t ever fetch the page.

Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. I’ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadn’t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones I’ve mentioned so far) let Google into more of their pages these days.

That’s why we might show uncrawled urls in response to a query, even if we can’t fetch a url because of robots.txt. So ‘googlebot=nocrawl’ pages might show up as uncrawled. The two preferred ways to have the pages not even show up in Google would be A) to use the “noindex” meta tag that I mentioned above, or B) to use the url removal tool that Google provides. I’ve seen too many people make a mistake with option B and shoot themselves in the foot, so I would recommend just going with the noindex meta tag if you don’t want a page indexed.

Obscure note #2: You might think that the robots.txt that I gave would block a url only if it ends in ‘googlebot=nocrawl’, but in fact Google would match that parameter anywhere in the url. If (for some weird reason), you only wanted to block a url from crawling if ‘googlebot=nocrawl’ was the last thing on the line, you could use the ‘$’ character to signify the end of the line, like this:


User-agent: Googlebot
Disallow: *googlebot=nocrawl$

Using that robots.txt would block the url
http://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl
but not the url
http://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl&option=value.

If you hung on all the way to the end of this post, good for you! You know stuff that most people don’t know about Google now. If you want to try other experiments with robots.txt without any risk at all, use our robots.txt checker built into Sitemaps. It uses the same logic that the real Googlebot uses; that’s how I tested the stuff above.

Comments (80)

Google++

If you’re not a Googler, please ignore this post.

Okay, it’s just us Googlers now, right? I’m sure you’ve seen Danny Sullivan’s post about 25 things he loves about Google and 25 things he hates about Google. If your service got a shout-out on the love list, congratulations. There’s a ton of stuff that Google is doing really well, and lots of groups listen to what our users want and work hard to make that happen.

But: the list that everyone should mull over is what Danny hates. You’ve been handed detailed bug reports (for free!) from one of the foremost experts in search. Bug reports? Yup, that’s how I’d treat them. Sure, I’d disagree with a few (#3, #4, #9, and #16 are the ones that I’d respectfully disagree with the most). But the criticisms that Danny gives should be addressed, even if the issue is mostly perception.

The blessing (even if sometimes it feels like it’s not) of working at Google is that everyone has an opinion on Google and what they want Google to be doing. Of course we’re working hard on core search quality. Of course we’re working on great products. But I implore you, gentle Googler, to listen to users, webmasters, advertisers, and publishers whenever you can. Find out what issues are hot with our user support team. Browse the feedback from the “Dissatisfied” link. Read the feedback we’ve gotten on how to improve our products, search quality, communications, webmaster-related ideas, webspam, and miscellaneous feedback. We’ve been collecting feedback for years, and if you talk to users on a regular basis, it keeps you grounded and working on the right things. We don’t always have the time or cycles to fix every issue that a user asks about. But we should always strive to. And when a someone outside Google tells us something that they wish were different, we should look for scalable, robust ways to tackle it.

Comments (92)

SEO Advice: clean house before press releases

(Just a quickie post)

A quick tip related to checking your own site: if you’re going to send out a press release about your SEO company, make sure your site is clean before you do the press release.

Site in Internet Explorer:
website

Site after pressing the <ctrl>-A key:
website with hidden text

We often read press releases too. If you’re going to attract attention to yourself, make sure your site is clean first.

Comments (87)

SEO Advice: check your own site

Remember a while ago when I said that you should check your website for spam before doing a reinclusion request? In general, any time you think Google may be dropping your site, it’s a good idea to check for spam on your own site. For example, a site called “The People’s Cube” recently posted an open letter to Google at http://www.thepeoplescube.com/red/viewtopic.php?t=637 because thepeoplescube.com was no longer in Google’s index. Titled “Google Purges The People’s Cube Worldwide”, it begins “Dear comrades at Google” contains rhetoric like this:

We suspect it is also a deliberate removal - much in the spirit of 1984-style historical revisionism - removal of a “people’s enemy” from life and history. …. We can only think of three reasons for this:

1. Google is retaliating against sites that ridiculed its Google China project.
2. Google has begun to implement its Google China policies in the rest of the free world.
3. A left-leaning Google employee who’s got access to the database was suffering a nervous breakdown over the mockery of Marxism on our site, and so he or she dastardly removed/blocked The People’s Cube, hoping to “improve” the public discourse by silencing the competition.

You tell me which one it is.

Frankly, I could care less if your site is about neo-anti-Marxist disestablishmentarianism. I don’t care if your site is for cat people or dog people, or if you favor big-endian bytes or little-endian bytes. I don’t care whether your site is conservative, liberal, or Pastafarian. I care about spam: hidden text, hidden links, scamming links, cloaking, sneaky redirects, etc. So I’m going to go with reason #4: you had spam (specifically hidden text) on your pages. When Googlebot visited http://www.thepeoplescube.com/Truth.php on Sun, 05 Mar 2006 12:17:12 GMT, the page looked fine to users, but had hidden text. Here’s what it looked like with Cascading Style Sheets (CSS) on:

With CSS on

But if you turn off Cascading Style Sheets (CSS), you see stuff like this at the bottom of the page:

With CSS off

My answer to The People’s Cube is to make sure that all the hidden text/links are gone, and then do a reinclusion request. And don’t think I didn’t notice the cross-linking with sites such as gqw.us and sitexpress.net and che-mart.com that still have text/links hidden via CSS. Those sites should be cleaned up as well. I don’t know why thepeoplescube.com was hiding links to sites like buyonlywithus.com, which still has hidden text via CSS that says

Miami Beach Real Estate Florida Miami Real Estate Florida Investment Property Miami Beach Florida Condominiums Miami Florida Apartment Buildings For Sale Miami Florida Miami Beach Florida Apartment Buildings Miami Florida Apartment Buildings Miami Beach Florida MLS Search Miami Florida MLS Search Miami Beach Florida Lofts for Sale Miami Beach Waterfront Homes Miami Condos for Sale Miami Beach Florida Investment Property Miami houses condominiums preconstruction in Miami for vacation, retirement or investment, Miami Beach Real Estate Florida Miami Real Estate Florida Miami Beach Real Estate Florida Miami Real Estate Florida Miami Beach Real Estate Florida Miami Real Estate Florida Investment Property Miami Beach Florida Condominiums Miami Florida Apartment Buildings For Sale Miami Florida Miami Beach Florida Apartment Buildings Miami Florida Apartment Buildings Miami Beach Florida MLS Search Miami Florida MLS Search Miami Beach Florida Lofts for Sale Miami Beach Waterfront Homes Miami Condos for Sale Miami Beach Florida Investment Property Miami houses condominiums preconstruction in Miami for vacation, retirement or investment, Miami Beach Real Estate Florida Miami Real Estate Florida Miami Beach Real Estate Florida Miami Real Estate Florida

but that’s the sort of reason why you’d be removed from Google, not because of what thepeoplescube.com was saying.

Update: Anyone got an account over at little green footballs so I can go and debunk over there? Looks like registration is closed at lgf. Sigh.

Comments (191)

Review: ShuttlePRO Multimedia Controller

Short review: It just works. I love it. Highly recommended if you’re looking for a flexible USB input device. It looks like this:

Shuttle Pro 2

Longer review: I was in an Apple store with a friend this weekend who was buying a video iPod, and I saw the ShuttlePRO2 from Contour Design and bought it on an impulse. It’s pretty neat. Essentially, it’s a USB device with 15 input buttons and a jog/shuttle wheel for $99. I’m a sucker for interesting input devices; blame it on going to grad school at UNC-Chapel Hill, where one-of-a-kind input/output devices are commonplace.

The Shuttle Pro works with Mac or PC; I’m taking it in to work tomorrow to try with Linux. For Windows, it comes with a simple driver that lets you program each button to represent almost any keystroke. It also comes with a set of pre-printed labels and a set of blank labels; you can pry off the plastic top of each key with your fingernail to put a label under each button. Contour Design even provides template files for making professional labels. And did I mention that the controller just works? So I set the first four buttons to be h, j, k, l (the keys that you use to move left, down, up, right in vi).

Now I needed to do something with my controller. So it was time to install Python 2.4 and then Pygame, which provides Python bindings to the SDL (Simple DirectMedia Layer) library. Pygame has some really nice tutorials too. Starting from the program in this tutorial, in less than an hour and in less than 40 lines of code, I’m controlling a bouncing ball with my new USB device:

Bouncing ball

And that was the beginning to a really fun, geeky weekend. :)

Comments (16)

Next entries » · « Previous entries