Bot Obedience: Herding Googlebot

I noticed a useful session at the upcoming Search Engine Strategies conference in San Jose. In exactly a month there will be a Bot Obedience class. People sometimes ask me about how to “sculpt” where Googlebot visits, and my only other post about this was pretty technical, so I’ll take a stab at a shorter, clearer post.

At a site or directory level, I recommend an .htaccess file to add password protection to part of a domain. I wrote a quick example of setting up an .htaccess file about this time last year. I’m not aware of any bot (including Googlebot) that guesses passwords, so this is quite effective at keeping content out of search engines.

At a site or directory level, I also recommend a robots.txt file. Google provides a simple robots.txt checking tool to test out files before putting them live.

At a page level, use meta tags at the top of your html page. The noindex meta tag will keep a page from showing up in Google’s index at all. This tag is great on any page that’s confidential. The nofollow meta tag will prevent Googlebot from following any outgoing links from a page. This page shows the proper syntax.

At a link level, you can add a nofollow tag on the granularity of individual links to prevent Googlebot from crawling individual links (you could also make the link redirect through a page that is forbidden by robots.txt). Bear in mind that if other pages link to a url, Googlebot may find the url through those other paths. If you can, I’d recommend using .htaccess or robots.txt (at a directory level) or meta tags (at a page level) to be safe. I’ve seen people try to sculpt Googlebot visits at the link level, and they always seem to forget and miss a few links.

If the content has already been crawled, you can use our url removal tool. This should be your last resort; it’s much easier to prevent us from crawling than to remove content afterwards (plus the content will be removed for six months). This help page discusses how to remove other types of content from Google.

Update: Vanessa Fox pointed out this Googlebot help page which covers a ton of other Googlebot questions.

75 Responses to Bot Obedience: Herding Googlebot (Leave a comment)

  1. should be a very interesting session

  2. William Donelson

    “simple robots.txt checking tool” link goes deeper and deeper and you end up HAVING TO HAVE GOOGLE SITEMAPS — NOT SIMPLE.

  3. Can we use the removal tool to remove just one page, or it applies to the whole site?

  4. Sooo…… here is a question then…
    What about a page that you told Googlebot to not index (via robots.txt) but Google has found the page and lists it, but without a description/title, etc. I know that you have commented on this before (you mentioned that they could get more traffic if they just opened up that page to Googlebot…).

    The page in question is a “tags” page where we aggregate a bunch of articles (columns, blogs, whatever you want to call it) from our site for a specific tag (category, keyword, whatever you want to call it). Up until now, we have blocked the bots from reading to eliminate “duplicate content” although each page would be unique to an extent because it would bring together similar articles whereas now they are just sorted into archival date.

    Would it be better to 1) keep blocking Googlebot and have “empty” results show up in the SERPs 2) take some time and optimize those categories and open the page up to Googlebot 3) Add a “noindex” to the top of the page in question?

  5. Thanx for those tips….. :)

    Apparently using JavaScript Links also seem to neutralize spidering, not redirects, but JavaScript Links. Also, using JavaScript encrypted links are an alternative

    But another piece of advice – the robots.txt files can be accessed directly by nosey, savy Webmasters. God forbid you are ever in a legal dispute.

    And even if the URL removal request went through – :LOL You’d be surprise what nowadays ends up in the Web Archive. :o

    Another option is just changing the contents of the file that is already on Google – you may want to expand your site, so the URL is already on Google – USE IT!

    …..and use a completely new URL for the private content, and take the precautions mentioned in this topic

  6. Key_Master

    Hi Matt,

    I would like to see Google implement a more advanced robot.txt protocol that utilizes server headers to display bot directives.

    Examples:

    Header append Robots “noindex”

    Header append Googlebot “noarchive”

  7. Key_Master

    It appears as though the blogware modified the examples I had given. A better explanation can be found here:

    http://www.webmasterworld.com/forum30/34757-3-20.htm

  8. I have some bad (cybersquatter) pages indexed by Google that need to be removed w/ the remove tool, and they all point to index.php?whatever, and unfortunately I’m also using the name “index.php” because of WordPress and it’s too late to rename index.php to something else. So I have been pulling my hair out to get these cybersquatted index.php pages to 404 using .htaccess RewriteCond or whatever so that the Google Remove Tool will (hopefully) allow me to get this p0rn out of the index. Anyway, Google *should* make this easier by just allowing me to remove links from the index no questions asked. I shouldn’t have to write a bunch of code to *prove* that pages do not exist with 404 errors that are sometimes very difficult to generate.

  9. chris

    since I can never get a good answer when I ask this in forums why does google have my pages indexed properly on my blog for only about 1/3 the time? currently I have just over 1000 pages on my blog and 2 days ago google had 800 of them indexed properly, today there are 750 but 400+ are suplamental. this is driving me nuts! It has happened 6 times in 6 months. the pages are still there, I do not have this problem with yahoo and msn, what am I doing wrong? or is it google?
    I know this is off topic but I be pissed you feel me? :-)

  10. chris

    shoe, go play with your kid!

  11. Thanks Matt, I’ve never removed a URL but one our forums got spammed and we took it down. It all got too hard. Can I use the tool to remove a whole directory of pages?

  12. Please monetion to your readers that the site removal webpage removes pages/website for a minimum of 6 months.

    Also,

    At a link level, you can add a nofollow tag on the granularity of individual links to prevent Googlebot from crawling individual links

    By mentioning “crawl” some people might be inclined to think that the particular page that the link links to, will not get indexed. IT WILL.

  13. Hi Matt,
    Do you have any recommendation how to back supplemental pages to the normal level (that allows them to actually rank for certian keywords and be included in index)? What actually happens when I have 2 pages that are 100% of the same content, but just the headers are different (for example on one page in the header there are 3 links: a.com, b.com, c.com and on the other one there are 3 different links say: d.com, e.com, f.com), will it flag those pages as a content dup. and mark them as supplemental?

    Thank you…

  14. What happens after the “minimum” 6 months?

  15. New Jersey Guy, you can definitely select smaller parts of a site; I believe you can go down to the url level. Be really careful though, because the pages disappear for six months. Don’t try to do anything like “I’ll just remove the non-www version of my site” because if we know that www is the same as non-www, it can remove the whole site.

    William Donelson, I like the Sitemaps robots.txt tool because it’s pretty much identical to how Googlebot operates, and knows about stuff like wildcards, but there are lots of robots.txt checkers. Brett Tabke has one on searchengineworld.com, and http://tool.motoricerca.info/robots-checker.phtml is another. I’d just search for [robots.txt checker] on Google.

    AHFX, I’d vote for #2 or #3, and probably #2. It’s true that with robots.txt, Google can still show the url reference (without crawling the url). That dates back to when Ebay and the NY Times forbade all bots, but we still wanted to return ebay.com and nytimes.com. To not show up at all, I’d allow us to crawl it but put a noindex meta tag in place.

    S.E.W., in general I’d be cautious of relying on JavaScript to keep bots from following a link. Google for one has gotten better at crawling even when links are in JavaScript.

    Cristian, done about the six months. Those pages might show up in Yahoo!, but I don’t believe that they will carry any weight for Google. I think Graywolf posted an example of an (MSDN?) page where it looked like we’d followed a nofollow link, but when we dug into it, it looked like the link had been a “follow” link for a long time, and MSDN had just added nofollow tag afterwards.

    Brandi Belle, we might detect those pages as dups or near dups and choose to show only one page, but that probably wouldn’t cause pages to get bumped to supplemental. In general, the best way I know of to move sites from more supplemental to normal is to get high-quality links (don’t bother to get low-quality links just for links’ sake).

    Shoemoney, I’m hoping Vanessa Fox or another Googler can join the Bot Obedience panel. In case I’m not there, I wanted to get down my thoughts. :)

  16. Yes prevention is much better then cure. Here is what one University had to go through to get content deleted from Google’s cache:

    http://digg.com/security/Getting_things_deleted_from_Google_s_cache

  17. Glad you are back Matt.

    I would be interested in reading a blog post from you on the “site:” command issue; why others like myself feel that it is not reporting the correct results.

    Any updates on this issue?

    Or is it that the same answer that we don’t have enough quality backlinks to get our sites crawled more frequently?

    Thanks.

  18. General Public

    Google bot indexing is getting irrelevant, in most competitive categories the SERP pages are “hand rank”(HR) algorithm rather than page rank(PR) algorithms these days.

    The main criteria hand rank people use to judge a page is how good a page looks and they have their own misconceptions like “web design = graphic design” or “web design = multimedia design” so they will rank paper brochure design and CD presentation co’s higher than web design companies in web design SERP and so on…..

    This is exactly the situation of Altavista in 1999, their search algorithms failed and they started hand adjusting SERP and rest is history.

    Google adsense effect on the internet has been “millions” or is it “billions” of irrelevant made for adsense pages appearing on the net….. now google seems to be going down fighting the monster they themselves created…..

    My years of watching the rise and fall of search engines one thing is clear no single entity can index or categorize entire internet, hence google or any new entity dominating the web is simply a myth!!!

  19. Dave, I’m still coming up to speed on site:. I’ve seen at least a couple situations where the site: estimates aren’t accurate lately. I believe the datacenter at 72.14.207.104 has more accurate site: estimates for one of those two cases.

  20. rg

    It would be nice to be able to “herd” googlebot once a page has gotten into the supplemental index. How difficult would it be to modify the normal googlebot so that if it found a page with a 301 on it, to check the supplemental index, and if the page was in there, remove it. Waiting 6+ months for supplemental googlebot to roll around is just not acceptable.

  21. rg, the cycle time on supplemental googlebot has been going down recently.

  22. About the sitemaps / robots.txt checker notes:

    keep in mind that it is NOT necessary to set up a sitemap on a site in order to access other “site maps” functionality, such as the robots checker. The only requirements are a google account and the ability to put an empty text file on the site’s server or in page meta tags.

    The name “sitemaps” is a bit of a misnomer; Google Webmaster Dashboard would more accurately describe the current tool which has nicely evolved from the original sitemaps function.

  23. MB

    I have added my site to google using site map. and twice it has refused to crawl saying homepage unreachable. Before putting it in the sitemap, it was index but after the message home page unreachable, it has disappeard from the google search engine.

    Sitemap has removed my site which was properly indexed and was getting good ranks for unknown error. I will never ever use the site map again. Instead of this, googles free add url works well.

  24. Hey Matt,
    Thanks for the answer. I got another quick question for you, and if you can please answer.
    Namely one of my sites (wordpress blog) was ranked very well for certian keywords, it was in top 10 listing, I even managed to get 2nd position. It was all ok before 27th of June, when my site dropped from the 1st page and went who knows where (I can’t even find it in top100 now). The most interesting thing is that I actually didn’t touch site at all, I was just adding fresh keyword-rich posts (content) every day as I was doing before. No hidden text/links, cloaking or other black hat things, everything was pure white hat seo.
    My pages are indexed well and site: query shows them all (they are not supplementals). I am wondering what could cause my ranking drop so bad?

    Thank you,
    Brandi

  25. I think Graywolf posted an example of an (MSDN?) page where it looked like we’d followed a nofollow link, but when we dug into it, it looked like the link had been a “follow” link for a long time, and MSDN had just added nofollow tag afterwards.

    A lot of my nofollow links (internal or external) get spidered (Gbot visits them) and get indexed (without having any other IBL’s from anywhere else). I can say this for the past two months, this beeing the timespan in which I made tests. I do have other pages that indeed, do not get indexed. It’s pretty instable.

    I believe the datacenter at 72.14.207.104 has more accurate site: estimates for one of those two cases.

    I watch this DC (and another one that presents the exact same results: 72.14.207.99) for 10+ of my websites (with the indexed pages problem). It’s not accurate at all Matt. Whilst it does provide a larger number of indexed pages, 90% of them are Old/Outdated/Not existing anymore pages :

    as retrieved on 20 Aug 2005 21:56:28 GMT.

    I can see this issue (on those 2 DC’s above) for all the websites I have problems with (extremly low indexed pages).

    I don’t want to stress you out (or point another link, so people can call me a spammer) but I posted an article on my blog, with what webmasters can do to try and get their usual number of pages indexed. Do you believe I hit some good points or not ? If not, can you please point us/them what they should REALLY be doing to ensure they get their pages back ? (I’m saying this, with the ideea that Google CAN show all the indexed pages, and NOT that it’s broken 100%. Just partially).

  26. Tim

    Hi Matt,

    thank you for elaborating on this topic. Perhaps you can tell me, how G-bot treats the following:

    I have a sitemap on my site with some thousand links to make it easy for robots (not just google) to discover all my files. As I don’t want this sitemap to show up in the SERPS (I guess this sitemap wouldn’t be of much interest to users or it would even be considered spam) I have put the following in the meta:

    I gather from http://www.google.com/support/webmasters/bin/answer.py?answer=35303 that this is not the standard syntax, but do you think / know whether this make sense?

    Thank you very much in advance and sorry for my broken English.

  27. Tim

    hm, somehow this part got cut out:

    (putting asterix in so it doesn’t get sorted out again..)

  28. Tim

    …somehow WP filters out that line… I mean noindex, follow in the Meta of the page

  29. Vanessa Fox

    William Donelson, Sean Carlos is right. You don’t have to create a Sitemap to use the Sitemaps robots.txt analysis tool. In fact, you don’t even have to verify site ownership. All you need to do is log in, add your site URL, and go to the robots.txt analysis link.

  30. Accidentally posted this on your robots.txt thread instead of here…
    Purge the other if you want.

    Matt,

    Robots.txt should always be dynamic based on the user agent and only show this to the rest of the world::

    User-Agent: *
    Disallow: /

    The reason is that once you publicly show all the bad bots what user agents you allow in robots.txt they simply cloak for those bot names and slide right thru your .htaccess filtering for bad bots like it didn’t exist.

    Show them nothing and they have to play hit ‘n miss with user agent names trying to bypass your filters and expose themselves.

    BTW, why doesn’t Google post the exact IP ranges used by Googlebot so we can effectively lock out spoofers without risking locking out Google?

    Google also crawls thru cloaked directories that some proxy servers show Google and hijack pages when they crawl thru these redirected links, which happened to be several times before I blocked that.

    To stop spoofers and being hijacked via proxies, I now only allow Googlebot based on known IP ranges for Googlebot OR if NSLOOKUP or WHOIS says the IP is owned by Google. Anything outside of that spectrum claiming to be Googlebot gets slapped in the face, but the official IP list of all Google crawlers would make life MUCH easier.

    How about it Matt, do you have a published list I’ve missed?

    Ask Jeeves was nice enough to give me their definitive IP list…

  31. spike

    ** BTW, why doesn’t Google post the exact IP ranges used by Googlebot so we can effectively lock out spoofers without risking locking out Google? **

    There ya go: http://ws.arin.net/whois/?queryinput=google

  32. Spike, that’s not exactly what I’m looking for as that’s all-inclusive of everything Google does, including the data centers, web accelerator, etc. and I’m only looking for validate the crawler specifically..

  33. Matt, can you address how to remove a single page in the google cache that is interlinked with itself using url parameters caused by a programming mistake at our end. The Url removal tool will not accept a https:// link. There are >1,800 of these pages and googlebot vists 30 a day. The index has some as old at feb 2005. Tried 301 the page to a single page but cannot be sure it is workling after 4 months. This is driving me nuts. If I knew the 301 would remove a page with other links to it I could sleep beter at night. Thanks.

  34. Sean Carlos, I agree that we’ve sort of outgrown the “Sitemaps” name. What would you call it? Google Webmaster Console? Webmaster Central? Big Box O’ Webmaster Tools?

    Brandi Belle, there was a refresh of the data that we use in our algorithms, and it sounds like that’s what affected you. I believe that there will be another update in 2-3 weeks which could help somewhat. In general, the best advice I’d give is to make sure that the site adds value and has original content.

    IncrediBILL, I don’t think we’ve done so in the past because it changes from time to time, and we didn’t want to give bad/stale information.

    Cristian, the site: numbers at 72.14.207.104 are more accurate, but I’m aware of at least one situation in which the site: numbers are still inaccurate. Plus I think the supplemental results will be more fresh as the cycle time goes down in the future. I looked on your blog but didn’t see the article about the number of pages indexed?

  35. but I’m aware of at least one situation in which the site: numbers are still inaccurate.

    I guess I’m the 2nd one.

    I looked on your blog but didn’t see the article about the number of pages indexed?

    There you go. And thanks for taking the time to look into it.

  36. Brandi Belle, there was a refresh of the data that we use in our algorithms, and it sounds like that’s what affected you. I believe that there will be another update in 2-3 weeks which could help somewhat. In general, the best advice I’d give is to make sure that the site adds value and has original content.

    LMAO!

    I think the pic of her smoking while having a guy do her from behind is pretty original. I haven’t seen that before. :D

    I think Matt’s blog has taken on a whole new dimension…pornstar SEO. :D

    Brandi Belle, I’ve got nothing against what you do, I think it’s cool as long as it’s safe and nobody gets hurt (doesn’t look like anyone does unless you ash the cigarette in the guy’s eye or something)…I just think it’s damn funny that no one’s looked a little deeper into your site yet (apparently including Matt himself…our boy’s got a bit of the freak in him, apparently. :) )

  37. LOL .. I noticed her site now too. Man she’s getting it from behind….

    ROTFLMAO

  38. Dave (Original)

    Isn’t smoking suposed to be done after sex, not during?

  39. I am sure the real Brandi does not use the word “keyword rich”. ;)

  40. Isn’t smoking suposed to be done after sex, not during?

    So our girl’s started a new technique. Let’s not ask questions about what SHOULD be done. Let’s applaud her for her talent, flexibility, and ability to somehow inhale the leftovers from the repaving of the Brooklyn Bridge. The girl’s got skills. :)

    Hey Aaron, when you gonna interview that one? She’d go well with the Raspberry interview. I promise I won’t rip this one to shreds. Really. :)

  41. Google doesn’t always abide by your robots.txt rules. If you exclude your javascript or stylesheet, of course Google will look what you’re trying to hide.

    Excluding them used to be a great trick to have an h1 tag look like normal text or not displaying it (if you wanted to use an image header and needed the h1 for search engines), but now Google reads the disallowed css to see this technique.

    Keep this in mind when you use disallows, nofollows and other techniques. They could make you look suspicious.

  42. Matt,

    This isn’t rocket science.

    Stick your current crawler IP list in XML so we can download them daily as needed, nothing special about that. Besides, I’d rather have a few hours of stale IP addresses than a hijacked page thanks for a proxy site any day of the week.

    It’s easier to get Google to crawl again than it is to get hijacked pages out of your servers.

    Seriously, for everything Google has done good with sitemaps, which is cool, get busy on the flipside and help us thwart spoofers, and hijackings.

    The IP list would be a start.

    FYI, I’m speaking about bot obedience @ SES ;)

  43. IncrediBILL I get your thinking but I just don’t see that IPs scaling, as IP’s change all the time and who’s to say all the traffic coming from an IP is a bot, perhaps it’s also Matt and his team checking through stuff.

    I suggest that bot authentication is something that should be handled by the http://www.ietf.org/ with a proper RFC process, as it has much more long term implications.

  44. Matt, instead of re-naming it, how about some synergy between it and the other webmaster related products.

    For example.. I should be able to go to google.com/webmasters and not only see my sitemap data, but ..
    My analytics data, tied into my adwords data so I can see what keywords convert more.
    My adsense data so I can see if i’m paying $2 / click on adwords and those visitors are clicking adsense.

    If I could login and see “you got 15 visitors today for keyword widgets (7 were from paid search costing $6.50), you rank 10th on average for that keyword, and 2% of your visitors for that keyword converted into sales” life would be good.

    Currently, I’m having to use my own custom code to do this, but it doesn’t make any sense since Google has all this data. It’s just another sql join…..

  45. The Adam That Doesn’t Belong To Matt – You do know that “brandi” is probably a big smelly dude who has a huge network of prOn blogs right? HA!!!

    I would interview her but I would be tempted to start smoking again. ;)

    Doh!

  46. Hi Matt,

    The DC you refer to (72.14.207.104) that is giving “more accurate” results for site: is actully showing all supplimental for my site. In Addition, it is showing pages that have NEVER existed. It’s NOT just the number of pages, the problem IS much deeper than that.

    Please take a look and see for yourself :-)

  47. I suggest that bot authentication is something that should be handled by the http://www.ietf.org/ with a proper RFC process, as it has much more long term implications.

    Well, that would be NICE, but that’s not the reality of what’s on the net today. I’ll bet however they decide to identify bots people will still SPOOF that information, and I’m right back to IPs again.

    Trust me. the crawler’s IPs are what I need. For instance I may want to block a user and ask them for a password yet let Google crawl thru. However, I don’t want a user to set their user agent to Googlebot just to skip registration, which is possible if you don’t check the source IP. Also, I might not want anyone Googlers not to register and walk right in just because they come from same block of IPs that the crawler might.

    Besides, not all bots and crawlers are run by the big search engines, as I block from 100-300 a day, and spybots run by corporations like Cyveillance and claim to be Internet Explorer. I just want to let the good SE’s in, keep the junk out, and keep the SE’s that I do let in not crawl thru 3rd parties and hijack my pages, easy enough.

    Doesn’t matter, I’m filtering them anyway but I’m using the shotgun approach at the moment and covering all Google IPs instead of a surgical approach of only which IPs are used by the crawlers, unlike other SEs that were nice enough to give me their list ;)

  48. I suggest that bot authentication is something that should be handled by the http://www.ietf.org/ with a proper RFC process, as it has much more long term implications.

    Well, that would be NICE, but that’s not the reality of what’s on the net today. I’ll bet however they decide to identify bots people will still SPOOF that information, and I’m right back to IPs again.

    Trust me. the crawler’s IPs are what I need. For instance I may want to block a user and ask them for a password yet let Google crawl thru. However, I don’t want a user to set their user agent to Googlebot just to skip registration, which is possible if you don’t check the source IP. Also, I might not want anyone Googlers not to register and walk right in just because they come from same block of IPs that the crawler might.

    Besides, not all bots and crawlers are run by the big search engines, as I block from 100-300 a day, and spybots run by corporations like Cyveillance and claim to be Internet Explorer. I just want to let the good SE’s in, keep the junk out, and keep the SE’s that I do let in not crawl thru 3rd parties and hijack my pages, easy enough.

    Doesn’t matter, I’m filtering them anyway but I’m using the shotgun approach at the moment and covering all Google IPs instead of a surgical approach of only which IPs are used by the crawlers, unlike other SEs that were nice enough to give me their list ;)

    OOPS, hope Matt can zap the previous post.

  49. I would interview her but I would be tempted to start smoking again.

    No problem. Just give her the cigarette. I’m sure she’d take one for the team. ;)

    (I’m way too easily amused sometimes.)

  50. I agree there’s the, what must be done now, and then the ideal. What I am suggesting is long term it’s a bigger problem that needs thinking through. Perhaps some sort of shared key system whereby you could precisely identify a bot based on it properly authenticating with you would be great.

    Sitemaps already authenticates a website owner (admittedly with a few hiccups to start with…see Dave Nayler ) , what if it went further and flipped that round and included an encrypted key that the spider passed to you that was unique to your URL/URL’s (maybe drop it into the headers?) that only you could certify based on having a relationship with that SE via Sitemaps or similar. If there was some sort of common system all the SE’s had to use backed up by the right site level software you could just choose which spiders you let in and which you don’t because they couldn’t spoof it (hopefully). You make it something that is part of IIS and Apache by default.

    Then you take the issue of malicious bots that and make it a legal requirement to be part of scheme so that you can just opt out of them.

    Obviously until such time…block those IP’s :-)

  51. Nothing to add other than it’s not that often that I am ROFLMAO reading Matt’s blog…

    “I am sure the real Brandi does not use the word “keyword rich”.”

    HAHAHAHAHAHAHA!!!!!!!!!!!!!!!!!!!!! Great stuff today!

  52. Ryan

    Remember when everybody made fun of Jeremy for being only 3 clicks away from lesbian porn?

    Is this worse?

    Sorry so late on this, didn’t dare try to look at the brandi site while at work.

  53. Tyler

    I had a site which was miserably indexed due to some dupe issues generated by my site. I did a manual remove, hoping to be re-indexed and get things cleaned up. I dont remember seeing a 6 month wait warning or I definately would not have gone that route. Any way to get a quicker reinclusion Matt?

    “If the content has already been crawled, you can use our url removal tool. This should be your last resort; it’s much easier to prevent us from crawling than to remove content afterwards (plus the content will be removed for six months).”

  54. Dave (Original)

    I didn’t dare look at it while at home! At work the worse they can do is sack you, at home the wife……………well you know, I’m sure :)

  55. Our site is an ecommerce site that generates session ids when a user is running around. I need to prevent Googlebot from indexing product pages without session id data. If Googlebot requests a page in our store directly it does not have session id data in the URL. However, if it follows links within the store the session data will always be in the url.

    We created a G sitemap with direct urls for every store page – which are individual product pages, or perhaps a page with several related products. If session id data is in a URL the URL always has xxxx/cgi-local in the URL and our robots.txt file disallows /cgi-local. Poor old Googlebot trys to index lots of our pages following links in our store with /cgi-local in the URL and thankfully it G doesn’t index these pages.

    Our site has been in Google for a lot of years and G onetime indexed a few hundred pages with the session id data. G’s index also has many of our store pages without session id data. 621 in all. I think the urls with the session data will eventually disappear. They are not in G’s index now.

    I looked at our sitemaps data recently and it listed several thousand Googlebot page accesses for the cgi-local directory. I hoped the sitemap with direct links to every site page would prevent that. The pages aren’t being indexed, but I’d like to avoid Googlebot crawling around for nothing. Am I just nuts worrying about it, or is there anything else we can do to herd Googlebot? My guess is Googlebot is going to follow any link it wants, just like your cat is gonna unplug anything it wants also. By the way, I get rid of old 3 Gig Maxtor, etc. with a sledge hammer for security purposes.

  56. I have a question. Suppose I INCLUDE A NO- INDEX TAG on a webpage, then which case of following will be true ?

    Case 1: Google won’t crawl the webpage at all
    Case 2 : Google bot will crawl the webpage but wont show it in its cache or results.

    you have nice information in this blog.
    Thanks,

  57. One of the more entertaining posts I have seen recently.

    On a more serious note. I think getting bots to respond to parameters defined either by Meta information or via robots.txt has now long been a moot point, whether it be Google or any of the other major players.

    Matt, might it be worth using your extensive expertise and doing an authoritive post on the subject, particularly from a Big G perspective?

  58. IncrediBILL, maybe it’s better to use Agent tag. I’m not sure you should stick to IP’s…

  59. I use BOTH the user agent tag and the IP’s to stop spoofing and hijacking from anything outside the valid Google IP range. The problem is, some other things seem to slip thru the cracks on their IP range like the web accelerator and stuff so I’m going to try to harden my filters a bit more, working in the dark, and hope I don’t zap actually visitors or a Google spider.

  60. whats the big prob if i stick to ip’s than to use agent tag? Isn’t this article correct?

  61. One of the more entertaining posts I have seen recently.

    On a more serious note. I think getting bots to respond to parameters defined either by Meta information or via robots.txt has now long been a moot point, whether it be Google or any of the other major players.

    Matt, might it be worth using your extensive expertise and doing an authoritive post on the subject, particularly from a Big G perspective?

  62. thanks for the useful suggestions, will use them when the situation calls for it.

  63. patil

    I am wondering hwo to remove a blog from Google cache. This(http://services.google.com:8882/urlconsole/controller) gives me just 3 options
    # Remove pages, subdirectories or images using a robots.txt file.
    Your robots.txt file need not be in the root directory.

    # Remove a single page using meta tags.

    # Remove an outdated link.

    So you mean I delete my blog ? But what I just want to remove just one blog entry ?

  64. Too bad I’m not anywhere near the US otherwise I think the sessions will be very useful.

  65. DIO

    SEO is more of an art than a science since most people are not aware of what the exact search engine algorithm is like. Nonetheless, the tips listed on this page are useful.

  66. Matt said: “At a page level, use meta tags at the top of your html page. The noindex meta tag will keep a page from showing up in Google’s index at all. This tag is great on any page that’s confidential.”

    This doesn’t seem to be working as designed. I’m seeing pages tagged “noindex” all over Google’s search results — and ranking well. Not isolated instances either, but dozens of pages that clearly are not intended to supposed to be indexed showing up at the top of the SERPs.

  67. Joe

    Hi Matt,

    This is a late reply, but a valid “feature request”

    My site has a high incidence of traffic from site rippers, spambots, image farming, and competitors stealing images and content. I find sites using my content and URLs in Google searches on a regular basis and report them as SPAM. I use traps, but a spider authentication process would be an easily implemented method of alleviating the problem of bad bots and rippers. A credit card processing company uses a simple method. I receive a payment notification via the company’s server posting a query to a designated URL on my site. I post (reflect if you will) that query, appended with a validation request back to their server, and they display a string of either “verified” or “invalid”. A method similar to this would simplify the process of setting trusted “users” and granting access to spider my site.

    Best of luck and keep those videos coming!!!

    Joe

  68. How safe is google URL removal ???

    I have mistakenly uploaded two pages index.htm and index.php

    Google cached index.php and yahoo and MSN has cached index.htm

    Which URL should I remove ?

  69. Looks like Google is spidering all of the pages, no matter what’s written in robots.txt. At least that happened in my case.

  70. I have a similar case where I’ve tried to control Google by denying access to /cgi-bin/ and it still indexes all of those pages.

  71. IncrediBILL, maybe it’s better to use Agent tag. I’m not sure you should stick to IP’s…

  72. Matt, you robots.txt file is pretty open – does that result in the bots picking up duplicate content on your site from the various locations your posts are stored on your site? Great articles btw.

  73. Did something change with this during the past week or so? I have a test site that has this in the robots file:

    User-agent: *
    Disallow: /*?

    and also has this on every page:

    Up until last week, pages were steadily disappearing from Google; last week, a search of

    commonword site:test.example.com

    returned only 5 results (down from more than 1000).

    Suddenly, today, it’s returning 16,000 pages!

    I’ve double-checked that the robot file and the noindex metatag on every page are still in place.

    Any idea why this would have changed suddenly? Lots of folks have test sites, and of course we don’t want Google to index all that duplicate content. I can put an htaccess password on the site, but it’d be nice not to have to.

    Thanks for the great info!

  74. This comment is directed directed toward to the comment person above “dave” . I agree, I would also be interested in reading a blog post from you on the “site:” command issue. I haven’t read to much into this but am always looking for more. Dave, from what I heard, the quality from backlinks are good.

Leave a Comment

Your email address will not be published. Required fields are marked *

*

If you have a question about your site specifically or a general question about search, your best bet is to post in our Webmaster Help Forum linked from http://google.com/webmasters

If you comment, please use your personal name, not your business name. Business names can sound salesy or spammy, and I would like to try people leaving their actual name instead.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php