New robots.txt tool

The Sitemaps team just introduced a new robots.txt tool into Sitemaps. The robots.txt file is one of the easiest things for a webmaster to make a mistake on. Brett Tabke’s Search Engine World has a great robots.txt tutorial and even a robots.txt validator.

Despite good info on the web, even experts can have a hard time knowing with 100% confidence what a certain robots.txt will do. When Danny Sullivan recently asked a question about prefixing matching, I had to go ask the crawl team to be completely sure. Part of the problem is that mucking around with robots.txt files is pretty rare; once you get it right once, you usually never have to think about the file again. Another issue is that if you get the file wrong, it can have a large impact on your site, so most people don’t mess with their robots.txt file very often. Finally, each search engine has slightly different extra options that they support. For example, Google permits wildcards (*) and the “Allow:” directive.

The nice thing about the robots.txt checker from the Sitemaps team is that it lets you take a robots.txt file out for a test drive and see how the real Googlebot would handle a file. Want to play with wildcards to allow all files except for ‘*.gif’? Go for it. Want to experiment with upper vs. lower case? Answer: upper vs. lower case doesn’t matter. Want to check whether hyphens matter for Google? Go wild. Answer: we’ll accept “UserAgent” or “User-Agent”, but we’ll remind you that the hyphenated version is the correct version.

The best part is that you can test a robots.txt file without risking anything by doing it on your live site. For example, Google permits the “Allow:” directive, and it also permits more specific directives to override more general directives. Imagine that you wanted to disallow every bot except for Googlebot. You could test out this file:

User-Agent: *
Disallow: /

User-Agent: Googlebot
Allow: /

Then you can throw in a url like http://www.mattcutts.com/ and a user agent like Googlebot and get back a red or green color-coded response:

robots.txt result

I like that you can test out different robots.txt files without running any risk, and I like that you can see how Google’s real bot would respond as you tweak and tune it.

105 Responses to New robots.txt tool (Leave a comment)

  1. Is there any chance Google would support Crawl-delay in the future? I think Yahoo! and MSN support this feature, but Google says it isn’t recognized. My friend’s server had a high load because of the googlebot, and I suggested crawl-delay, but now I see that doesn’t make a difference.

  2. Good morning Matt

    The folks at Google’s Sitemaps team have been introducing more and more of great tools. Best of all its FREE! at least for the timebeing πŸ™‚

    Btw, when do you write something about how Googlebot see and read the metatags?

    And … Looooooong time no Gadgets posts (:-(

    Have a great sunny day.

  3. You guys are on the winning track with Sitemaps. Amazing stuff that group is pulling out of their magic hats!

  4. Perfectly right, robots.txt is generally a file that you create once with the website, then don’t touch it. At least I know that’s what I do. This seems like a nice tool for checking, but I particulary liked the example with many of the bad robots excluded.

  5. Matt, you guys at Google are great

    This robot tools is really useful, I used it long time ago

    Google Site Maps is helping us to have better exposure on Google and it is a lot easier to be crawled.

    There is also Google Analytics tool which really ROCKS πŸ™‚

    Useful Tools:

    Search Engine World Robots tools
    Google Site Maps
    Google Analytics

    Do not forget also to test the speed of your website (and optimize it): http://www.websiteoptimization.com/services/analyze/

  6. Nice! That sitemap thing has sooo much potential!

    Do you ever think we will see things like

    >Your site is linking to some urls of questionable repute..

    Or

    >Hey, you really do need to look at that ol link farm of your..

    Or

    >We can read javascript too dude, whats with all that var 1 var 2 unescape onload onmousemove cobblers?

    Like a site spam integrity check..hey, could even use sound files that LOL or ROTFL or a heavenly ‘laaaah’ angelic type sound for ultra white cleanliness πŸ™‚

    Ok, yup, i need to go get up and drink some coffee.

    Nice work in any case

  7. Sorry to be negative, but, imo, it’s bad to mess with the standard robots.txt driectives. “Allow” isn’t part of the standard and, from what you wrote Matt, neither is the * wildcard. So somebody checks the file for Google and finds it to be ok, BUT it won’t necessarily be ok for other engines.

    It’s fine to come up with new things, like rel=nofollow, but it’s bad to mess with existing standards, especially when it can be bad for people – they may take Google’s word for it and find other engines doing different things. As you rightly said, “if you get the file wrong, it can have a large impact on your site“.

    I know that “Allow” isn’t new from the Sitemaps team, but their new checker should not accept things like that as being correct – at least not without a bold red warning.

    Personally, I’ve always wanted Allow as part of the standard, but an engine shouldn’t introduce it unilaterally, and it definitely shouldn’t tell people that it’s ok without a strong warning.

    Having said all that, I haven’t looked at the checker, so I really don’t know what it does πŸ™

  8. Since two days, a large Dutch website (http://www.kieskeurig.nl/) seems to be gone from Google index, but also gone from the Google Directory. They are still in all the other DMOZ mirrors.

    Question: could a wrongfully implemented robots.txt kill a Google Directory listing?

    (Both BMW and Ricoh were removed from the directory, which lead me to think that that Dutch website also received a penalty. The problem is, nobody knows why, nothing seemed illegal).

  9. Is it normal for for Google spiders to take well over four months to properly index a site? I’ve been having this problem with my site (no snippet info) and your Google support team isn’t much help.

    I don’t have a robots.txt file in my directory, so it’s not possible for me to “get the file wrong.” Sorry no kissing up here.

  10. Aggregated news sites duplicating content, who gets credit for new posts in our blogs? Us or the people copying us?

    Us or the people copying us?
    Us or the people copying us?
    Us or the people copying us?

    PhilC – That last sentence is damn funny dude! πŸ˜‰

    and I know, I am an arse, deal with it!

  11. Hi Matt,

    Thank you for your explanation.

    BTW (Off topic)… Why you’re using the wordpress instead of blogger? πŸ™‚

    The world has some things that we’ll never ……………..

  12. As Martha would say – this is a good thing.
    I am new to sitemaps and love the new tools (highest pageranked page in web by month, etc).

    My one wish for sitemaps is to integrate with your team, Matt.
    So a webmaster or site owner (more importantly) can see if there is a ban on the url, reason, and length of ban (or next recache).

    This would be very nice for the owners of sites who pay people to manage the SEO to make sure they aren’t screwing things up.

    Thanks!

  13. Question:

    My Robots file specifically excludes:

    /blah*

    When I check in the new robots tool, Google tells me that indeed, my robots.txt does exclude this url format form being indexed.

    Then, when I go to ‘Error Stats’ I see many many lines of:

    /blah?query_string – “General HTTP Error”

    entries, all from a couple of days ago, even though my robots.txt has been in place for at least 8 months.

    Er… huh? Why are they there when Big G tells me it knows its not mean to index them????

    Something wrong with GBot, methinks.

    Any clues Matt?

  14. Good questions!

    Andrew Hitchcock, I asked the crawl team about this a while ago, and there’s a good reason. It turns out that a lot of webmasters give crawl-delay values that are way out of whack, in the sense that we’d only be able to crawl 15-20 urls from a site in an entire day. I’ll try to post more details about that sometime in the future. The crawl guys are interested in allowing people to give some sort of hostload hint, but it’s their opinion that crawl-delay isn’t the best way to do it.

    Harith, I can do a metatags post sometime. I know I need to work some gadget posts in too. πŸ™‚ The Sitemaps team seems intent on giving webmasters better and better tools to debug/diagnose issues, so I’d be surprised if we ever charged for those tools–helping webmasters with their sites helps Google too, after all, because we’re more likely to be able to crawl better content.

    rob, in my ideal world, Sitemaps would be a good place to alert webmasters of potential problem, including potential spam problems. For now there’s still plenty of regular stuff to tackle, but we’ve talked about how neat it would be to do some stuff like this.

    PhilC, I take your point. The problem is two-fold: 1) robots.txt is sort of an informal standard; I don’t think that there’s an RFC, as far as I know, and 2) lots of people go ahead and put things in robots.txt. If you see 5 million people use “Allow” even though that’s not mentioned anywhere in the standard docs, and you can tell what the people mean by Allow, it can be good to support that. But I agree that having a standard is useful; maybe there’s a way to get a bunch of folks together and talk about adding to the spec. I think we support everything other engines do (except for crawl-delay, for the reasons I mentioned above).

    benj, my first answer is “go into sitemaps and see when we last fetched your robots.txt file.” That info is in Sitemaps and should help you know when to expect us to have processed it. But my second answer is “you don’t need to exclude ‘/blah*’ because you can just exclude ‘/blah’ and search engines will do prefix matching. That’s exactly the question that Danny Sullivan asked, so I know the answer in a very fresh sense. πŸ™‚

  15. I used this tool yesterday for the first time. Just saw it there so I thought I’d try it out. Very nice work. Even found a site I had with no robots.txt file. I would have never gone back to check something like that on my own so this was great. Keep up the good work. I love all the tools coming through. It’s almost like gaining your sight after years of fumbling around in the dark. Thank the whole team at Google for these webmaster tools.

  16. I believe that the robots.txt “evaluator” is a great idea. I am pretty certain that a lot of people choose not to touch their robots.txt file due to uncertainity of how a change could negatively impact their rankings.

    It may be worthwhile to integrate the robots.txt tool into Google’s automatic URL removal system, or even make viewing the analysis and confirming the therapy a mandatory step before proceeding with the automatic removal (i.e. “this is what we will remove, are you sure you want it?” =)

    Another suggestion — it appears that the automatic removal system does not presently honor wildcards in the robots.txt file.

  17. Disallow: /test.html

    test.html – Blocked
    Test.html – Allowed

    It seems to allow everything when there is an uppercase letter.

  18. Neat tool – thanks again google.

    So say you have a site that is 100% search engine crawlable and have links from other indexed sites so you know finding your site won’t be a problem. Also, you don’t want to restrict any user agents from crawling any files or directories – is there a point to having a robots.txt?

    I guess the question is how important is a robots.txt file to the everyday webmaster?

  19. Oh Man!! Suuhhhweeeet! I’m going to be playing with this nifty little tool for the next few hours..xxx thx goooogle!!!

  20. I build search-related tools for bus. intelligence measurement and it would be great to have an API to pull the sitemaps data into my tools, to have the info all in one interface for the end users (site owners). Any plans for something like that?

  21. Michael,

    Some Web Analytics tools are able to identify unknown robot activity based on downloads of robots.txt. They might not track unsuccessful robots.txt requests.

  22. [quote]benj, my first answer is β€œgo into sitemaps and see when we last fetched your robots.txt file.”[/quote]

    Matt – that would be today! This means its impossible for me to know when the previous time was, unfortunately. Am sure it was less than 8 mths ago though!

    Ah well. Will keep an eye on it and see what happens.

    Thanks as always for your feedback to us every hungry GWatchers!

  23. Good point Sean Carlos – I didn’t think about that. Thanks. πŸ™‚

  24. On the http errors page, it would help out trememdously if there was link to the page the error was found. lots of good reading here, thanks all!

  25. Michael, I’ve also noticed that i’ve achieved much better MSN rankings simply by having a blank robots.txt file there than not having it there.

    Unexplainable, but MSN seems to prefer you at least have a blank file.

  26. Me too Ryan, i thought it was just me!

  27. Thanks Matt. I’d also like to thank the Sitemaps team. There are a lot of neat things coming out of the whole webmaster tools area.

    One odd thing: I just checked now, and my highest PageRank page for January just changed retroactively. I thought that was odd.

  28. I tried commeting 3 times with Firefox 1.5 and it said “inalid security code” so this is a test in IE.

    Failed in IE now trying case sensitive.

  29. The error check tool for robots.txt gave me an error “over the 2000 character limit”. I can’t find any documentation of that anywhere at Google Sitemap or the web.

    I really like your blog Matt and the Google Sitemap is a great idea.

    That Security Code thingy here should warn that it’s case sensitive as those things rarely are.

  30. Maybe I’m missing something, but what’s the point of “allow” in a robots.txt file when spiders, by default, crawl all pages?

  31. You’d be suprised how often sites do not have a robots.txt let alone have it correct.

  32. “It’s almost like gaining your sight after years of fumbling around in the dark. Thank the whole team at Google for these webmaster tools.” Will do, Nick.

    Michael Weir, in a situation like that, I wouldn’t worry about having a robots.txt. For a while, if Googlebot requested a robots.txt and got a Forbidden error, we wouldn’t crawl the site. But so many people had that problem that we changed it. So a 404 or a catchall page won’t cause problems for Googlebot.

    Ryan, interesting suggestion. I wonder if that’s true.

    Dave, you can do things like my example (exclude everybody, then allow only a specific bot). You could also use “Disallow: ” instead of “Allow: /” but lots of people get nervous about “Disallow: ” since it’s only one character away from completely blocking a site with “Disallow: /”

  33. Hello Matt,

    I want to know someting as some time a go our site was removed by google through ROBOTS.TXT removal. some had out the removal request Google URL consol for removal. and our bunch of site got removed by googlwe which wa good performing, we couldn’t figure out who was that. but we suffered great loss because of that.

    Now, we have submitted reinclusion request many times but no result, is there any way to get those site reincluded ??? or any idea when those sites would come in Google ??? if u want i can provide you with URL for reference….

    It would be great if u reply to this..

    Regards
    WP

  34. Good morning Matt

    This is to say SPECIAL thanks for taking the time to reply to comments. I know that it must be taking a bunch of time and efforts from your side. But I can assure you that the webmasters community highly appreciate such “interaction”.

    Wish you a great day.

  35. i thought you might be interested in a very obvious case of spamming…
    http://www.repairbuilding.com/

    oh, by the way, did you know that the stuff you wrote about bmw.de was featured in the London Metro!

  36. >rob, in my ideal world, Sitemaps would be a good place to alert webmasters of potential problem, including potential spam problems.

    You know, when I said about that earlier it was a light hearted look at what could be useful too, Im glad you guys are actually throwing the idea around. When you think of it it further, it really would be a cool thing for siteowners to know about.

    Some site owners and webmasters just aint that tech savvy, especially with regard to SEO. They employ an SEO on trust, to deliver results and improve their rankings (thats the aim). They have little concept of redirects and cloaking and kw density, tag structure, off site factors,algo papers, multiple domains, subdomains, link building, link spamming, hidden divs and the whole list of techniques that can be and are so often easily abused. If the SEO firm do stuff that messes things up for them, in some cases I’d imagine that the owner is simply unaware until its all too late! By which time they are left in a what do I do now situation having to either employ another SEO consulatant for a considered view, or rely on the BS of the company they’d employed. A tool that said look at this or that, with links to a glossary of terms describing the potential problem, might be useful.

    If the SEO community were aware that the sitemap tool did indeed identify and display areas of concern, then Id imagine that some might think twice about using them. I appreciate too that some might just use this knowldege to their competitive advantage and go tweak tweak tweaking! But hey! What a cool challenge for the development team nonetheless!

    Im sure you chaps could spare a buck or two on development πŸ™‚

  37. Interesting blog. Slightly off topic however I would like peoples opinion, especially Matts on article marketing. Is writing articles a good way of promoting a website and obtaining backlinks or do the search engines see it as spam?

    Thanks

  38. Rob, I’d imagine that making a tool like that would be rather complicated. I say this because i spent the last 10 minutes thinking of how I’d code it.

    Sure, it’s relatively easy to identify some techniques… but I can think of a counter-example for each technique that is valid.

    Example. Having a hidden div. This can be useful, on dotcult.com I have a bunch of hidden divs… but when you click a little question mark icon, they show up as a help box. This is a valid use of a hidden div… Same goes for the amazon owned technology of showing more form fields based on what you already filled out.

    What about javascript redirects based on browsers? Sure, with some web standards, CSS, and EMCAscript it’s not needed.. but many webmasters just don’t know that. A re-direct that takes me to 2 similiar pages shouldn’t be dirty.

    Hidden links? I use those too. On some of my pages, I used to have a hidden link to a file called emails.php. A user will never see it, but an email gathering script will… and when it did it got 100,000 randomly generated fake emails, and a few links to the same script on other sites. I don’t do it anymore, but I also don’t consider it search engine spam (my robots.txt forbid googlebot from seeing that page)

    White text on white background? Sure, no problem… If I also have an image background defined in my .css file.. then it’s visible again..

    Many of these things can only be determined by looking at the whole picture, or adding a human perspective. (especially in the dotcult case where clicking a link 1/2 way down the page makes the hidden div visible again)

    I think an algorithmic program like this would actually ban / penalize some sites that didn’t deserve it.

    I’m not sure of Google’s viewpoint, but If I were running it, I’d rather let in 10 spam results and ban no good sites, than ban 1 good site to keep out 10 spam results…. sorta like how our justice system is against sending innconent men to jail (gotta think pre patriot act here)

  39. I have same “problem” as benj. Page blocked by robots.txt listed in errors with “General HTTP error”.

  40. Hi Matt,

    As long as we are discussing the robots.txt file, I thought I’d raise this question. And I apologize if it has already been addressed.

    To keep a webpage out of the Google’s cache, a webmaster could use the ROBOTS NO ARCHIVE tag.
    So, is Google identifying the ROBOTS NO ARCHIVE tag as a suspicious sign that the website is engaging in cloaking? Will Sitemaps be able to head this problem off at the pass? Acting as another early warning indicator of possible cloaking? And, is that how BMW was identified?

    Disclaimer: I’m not a techie, just a marketer.

  41. Matt said: Want to experiment with upper vs. lower case? Answer: upper vs. lower case doesn’t matter.

    Yes, it does. It’s case sensitive – at least in my tests, based on a Unix platform.

    If Googlebot was smart enough to determine whether or not the platform itself was case sensitive, and treat the URLs in robots.txt accordingly, I’d be impressed (but not surprised). πŸ™‚

    The correct use is case sensitive, BTW.

  42. Hi Matt,

    Alot of what you are explaining goes right over my head as I learnt to design my website by myself from dreamweaver tutorials. I do try to follow everything on SEO but it is very confusing for the average person( Am I a webmaster:-) ). My website has been running for 3-4 years now.

    I tried your robot.txt validation tool and got these worrying results but dont know what having a 404 message means.

    Any ideas what I should do about these results.

    Robots.txt Validator
    http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

    http status: 200 OK

    Syntax check robots.txt on http://www.turkeyvillarental.com (660 bytes)
    Line Severity Code
    1 ERROR Invalid line:

    2 ERROR Invalid line:

    3 ERROR Invalid fieldname:

    4 ERROR Invalid line:
    turkeyvillarental.com
    5 ERROR Invalid line:

    6 ERROR Invalid fieldname:

    7 ERROR Invalid line:

    8 ERROR Invalid line:

    9 ERROR Invalid line:

    10 ERROR Invalid fieldname:
    The website for turkeyvillarental.com can be found by clicking here.
    11 ERROR Invalid fieldname:
    turkeyvillarental.com is registered through Easily.co.uk – get web site hosting or domain name registration here
    12 ERROR Invalid line:

    13 ERROR Invalid line:

    13 ERROR There should be atleast 1 disallow line in any Robots.txt.

    We’re sorry, this robots.txt does NOT validate.
    Warnings Detected: 3
    Errors Detected: 14
    3 warning Field names of robots.txt maybe case insensitive, but do capitalize field names to account for challenged robots.

    6 warning Field names of robots.txt maybe case insensitive, but do capitalize field names to account for challenged robots.

    11 warning Field names of robots.txt maybe case insensitive, but do capitalize field names to account for challenged robots.
    turkeyvillarental.com is registered through Easily.co.uk – get web site hosting or domain name registration here

    robots.txt source code for http://www.turkeyvillarental.com
    Line Code
    1
    2
    3
    4 turkeyvillarental.com
    5
    6
    7
    8
    9
    10 The website for turkeyvillarental.com can be found by clicking here.
    11 turkeyvillarental.com is registered through Easily.co.uk – get web site hosting or domain name registration here
    12
    13

    Search Engine World Robots.txt Validator
    robots.txt URL
    User: 86.137.54.118 reset
    _____

    v1.045 Enter the full url to the robots.txt file on your server

    Note: Please use the FULL http://www.mydomain.com/robots.txt url to your robots.txt file. This allows you to use alternate filenames in order to test development copies of your robots.txt file before making it live on the web and risking a robot running into it when it is invalid.
    Thank you to everyone who sent in feedback on this service and the validator. Last update: April-18-2002.

    _____

    Robots.txt Info at Search Engine World

    Related:

    Robots.txt Forum
    Robots.txt Tutorial with examples.
    Robots Exclusion Meta Tag Using robots metatags.
    Robots.txt : The Big Crawl
    We recently spidered 2million robots.txt files and found a surprising number of problems.
    Robots.txt Validator. To validate your robots.txt for syntax errors.
    Robots Exclusion Standard
    The current working draft proposal for the robots exclusion standard.
    SearchTools.com . Avi Rappoport large set of pages on Robots.txt.
    Robots.txt examples. You can use and modify (rename to robots.txt before using):

    robots1.txt allows all robots.
    robots2.txt bans/disallows all robots.
    robots3.txt bans/disallows robots from your cgi-bin and images directories.
    robots4.txt a comprehensive robots.txt file that only allows known “nice guy” spiders.

    _____

    Content-type: text/html

  43. Ryan, great points and yes I accept it wouldn’t be an easy task by any means. I dont think its fair or possible to ban domains for the very reasons you’ve shown. That said, it is at least feasible for a system of sorts to flag it for attention of the site owner.

    If he is aware of it then she can at least broach the issue with their seo or webmaster. 99% of this stuff is perfectly ok of course, and a rational explanation about say a particular mouseover effect to show a layer or action an event would be perfectly reasonable, and very very rarely spam.

    Its more a question of intent. And lets say for example sake that something is flagged, and the site owner chooses not to act on it, or at least question why its used, then IMO he is asking for any grief he should get further down the road.

    Hell, lets not understate this, it needs some massive thought and debate around how and why and what it would do, but..I think it could be worth it, simply by enabling people to know that little bit more about issues that could affect their livelihoods.

    It all boils down to trust really. We trust our Dr’s with our health, but are also free to get a 2nd opinion. In most cases we don’teither because the cost is too prohibitive or we just choose to trust in their opinion/diagnose. Weak analogy perhaps, but if sitemaps could offer that little bit of a pointer where they would be otherwise inorant than, I think on balance it could be of value.

  44. Matt, I’m curious to know your thoughts on links that add extra strings. For example, this is a normal link to one of my articles:
    http://www.familyresource.com/pregnancy/33/1054/

    However, I have a send to friend feature that emails the article link, and tacks on “?via=email,” so the link looks like this in the email message and on the browser if they click on it and visit the page:
    http://www.familyresource.com/pregnancy/33/1054/?via=email

    I also do something similar for my RSS feed (?via=feed). This of course helps me know how people are getting to the page (via search engine, feed, or email).

    Do I need to be doing any sort of redirect? Will Google see these links as duplicate content? Or is everything fine the way it is?

  45. When I went into Google sitemaps and checked the robots.txt standard option it said the maximum character amount is 2000 characters. Is this true and does it matter if it’s more then 2000 characters. I have a problem with the wildcard option so I had to make the file longer. Does google care if it’s more then 2000 characters?

    Matt, you have done a fantastic job with the blog. I love reading it and reading the responses of other knowledgeable reader’s comments.

  46. I see it doing more harm than good.

    Open the floodgates of emails to google “why am i flagged for this”, “i’m suing you called my page spam”.. “you didn’t warn me but i was still banned” etc etc..

    also, for seos to game the system.

    In essence, it’d be saying “ok here’s what we know about.. if we don’t catch it, go ahead and do it”

    while it would be useful to us, I just can’t see it being in google’s best interests.

    However the detction does seem like an interesting project.. I may just try to hack something similiar together just to entertain myself.

  47. Having google sitemaps tell us where we might have messed up is a good idea. Sometimes we just don’t know what to fix, be as honest as we can be with the site, and we still have a PR of 0.

    What can we do to make google like us?

    If google sitemaps (or you) could tell us that, that’d really help.

    Thanks,
    Aubrey

  48. Matt,
    can you please dobule check on the case sensitivity? I have changed my file names to lowercase, removed the old ones via Google remove, and blocked robots from accessing them so they don’t appear as supplementals 180 days later. If I block, let’s say, /Shoes/, will Google index domain.com/shoes/ ?

    thanks

  49. Matt,

    Maybe you can have the folks who built this fix the little quirk with the URL Removal Tool next?

    Reference:
    http://forums.searchenginewatch.com/showthread.php?t=9618

  50. Sticking to standards is one of the most important things in engineering and, I think, in webmastering in particular. However, it always takes somebody who would drive the standard. And in this case it may be the Google Sitemap team.

    BTW, is the site map already that useful?

  51. When is the next toolbar PR Update?

  52. The robots.txt feature appears to let webmasters check against other Google robots e.g. Googlebot-Image
    BUT if you try:
    User-agent: Googlebot-Image
    Disallow: /
    It says Googlebot-Image is allowed.
    I know it’s as easy to put all your images in one disallowed folder, but this way would appear to save Googlebot-Image wasting bandwith crawling the site.

  53. Sorry if this is a little off-topic, but it is still on the subject of giving bots hints over what they should crawl. We’re encouraged to use rel=”nofollow” where the link may be spam. Is the anchor text still read by Google?

    For example, your blog will insert rel=”nofollow” into this link: brilliant blog, but might Google still direct someone searching for the phrase β€˜brilliant blog’ to this page?

  54. I finally had a look at the new tool. I didn’t try it out but I had a look at it – I’ve done my duty πŸ™‚

    While I was in Sitemaps, I noticed two lists of words under Stats -> Page analysis

    Are these lists ranked in any way? Are they the top most commonly used words in the site and links, or are they just a miscellaneous selection?

    If they are merely a miscellaneous selection, is there any point in showing them? Are they useful to us in any way?

  55. Matt, I found a problem with the tool, documented on my blog (http://michaelteper.com/archive/2006/02/10/5837.aspx). I’d be interested to know whether you are able to reproduce it at the plex.

  56. Is there a problem if I do not use Robots.txt?

  57. No. The robots.txt protocol for disallowing spidering, so not disallowing it (not having a robots.txt file) is fine.

  58. Not having a robots.txt could result in a huge error log, that is why I would use an empty tobots.txt instead.

  59. The error log is a bit of a red herring, since *having* a robots.txt file means you just get hits in your access log instead of your error log. Plus, of course, your server has to deliver the file.

    The real difference between not having a robots.txt file and having a blank robots.txt file is this: no robots.txt file means you are implicitly giving robots permission to crawl your site (based on their assumptions); a blank robots.txt file means you are explicitly giving robots permission to crawl your site. If the actions of robots are ever questioned legally, that could be a big difference.

  60. Need an robots.txt crack. We have a Javascript snippet from our tracking provider (webtrekk) on every page. It sends tracking data to their servers. The Javascript contains the following section:

    var wt_dm=”track.webtrekk.de”; // track url
    var wt_ci=”88880000865111″; // webtrekk id

    var wt_be=”some.content”; // dynamic content id
    // more stuff

    Now, in our server logfiles, I see googlebot requesting invalid pages like:
    /path/page.asp/track.webtrekk.de
    /path2/page2.asp/track.webtrekk.de

    Which will always be a 404. How do I tell googlebot to ignore any page that ends with track.webtrekk.de in my robots.txt ?

  61. Perhaps, with regards to crawl-delay, you could support it up to a certain value (e.g. once per minute), and treat values over that as this maximum value?

  62. (sorry for this being two posts)

    I guess 1 page a minute is still 1440 pages a day, so perhaps for larger sites that’s not so suitable, and still ‘out of whack’.

    Perhaps you could weight the max crawl-delay value such that it depends on the number of pages indexed on a site. Perhaps up to a value such that you can index 1/10 of the entire site in a day, or the whole site in a week (I’m sure you could come up with a reasonable value).

    Or set it such that it doesn’t occur for unindexed pages (so that new content gets indexed fast).

    At the very least, I’d think a crawl-delay value of up to about 2-10 seconds should be respected.

  63. Hi Matt.
    I like the sitemap tool.
    The problem is that my site is still not verified. And yes, I do have the html file with the name Google search.

    Another question.
    Let’s say I have a virtual supermall. If every major category has like 15.000 products, do you recomend creations of subdomains for each category, or just subfolders for the main site.

    Thank you.
    Pex

  64. So Matt, can you explain why Googlebot disregards the robots.txt commands when someone links to no-index page/section? In addition, can you explain why google then penalizes a site for the no-index content contained on those pages? (more specifically, thousands of blank pages – since the sections had been set up – which were not ready for indexing in 7 brand new sections of existing site.)

    i.e. – /New Directory Section Awaiting Content Links and Descriptions/

    Has this issue been addressed by Google or is it still going on? Has Google taken the policy of indexing entire sites, including these no-index section – for spam or affiliate content in a effort to apply these factors to a trust/rank algorithmic analysis?

    The response I got from Google heads was the fact links pointed to the main page of each directory justified google disregarding robots commands and indexing thousands of “template” pages within those folders in which every page contained the no index no follow tag contained within those folders which were no index in the robots txt file.

    I’d like a bit more clarity on this since you seem to be the Google man now and this issue devistated top Google rankings for two months before Google removed the pages.

  65. Hi Matt,

    Thank you for your explanation.

    btw small question

    Disallow: /*?

    http://www.searchengineworld.com/cgi-bin/robotcheck.cgi result

    Possible Missplaced Wildcard. Although Google supports wildcards in the Disallow field, it is nonstandard.

  66. The robot txt validator looks like a great tool, but it looks like most of my home page came up as not valid. Now a really super tool would also show those of us with no html knowledge how to fix it. πŸ™‚

  67. It’s March 2006 and I integrated a robots.txt to exclude Googlebot from visiting pages that are 302 redirects to off site pages. Googlebot quit crawling them around the time I integrated the robots.txt file, August 2005, however the pages still remain in Google’s index to this day and are cached with August dates. The robots.txt seems to make Googlebot quit crawling a site, but not to remove it from its index! Comments?

  68. How can a website owner get it right when Google thinks they own everything and they can’t even follow the IETF and W3 guidelines? What gives you the right to change standards without going through the proper channels? Besides, even when we do follow the guidelines googlebot, the hack, breaks the rules and indexes things it isn’t even supposed to stick it’s nose in.

    Get your stuff fixed first before you even attempt to establish guidelines.

  69. PatMore – Forget about robots. Get a new xhtml 1.1 site built without frames using static html pages (non dynamic). Robots content follow is all you need plus the little methods i use, you should be TOP TOP TOP after 6 months, not 3-4 years!!! In fact for your keywords, top in 3 months should be a piece of pith.
    Say hi to Mrs Patterson for me.
    xx

  70. great blog by the way, robots are great fun to play with

  71. Hey matt I have seen on other message boards webmasterworld.com for one an error coming up in the Google site maps HTTP errors section.
    I am sure it is being generated by the bot as all the sites are different some are nothing but php I am a nothing but htm with no database to generate this ending
    .com/eoifcbfxfc.html
    .com/lxcunhvf.html
    .com/qbstjxyvx.html
    I don’t have it on my main site but I do on my subdomain
    http://www.webmasterworld.com/forum30/34272.htm

    If posssible send this to the sitemap team to research ok.
    Thanks

  72. Hi Matt,

    I have designed a website ” rajasthantourindia . com “. When I have launched this website I have used no-cache in meta. And I have seen that google.com have indexed all around 250 pages.

    After the completion of the website I have removed that meta no-cache page from all the pages, I did this around 1 month back.

    I have checked that in Google cache ( http://66.102.7.104/search?q=cache:dWLR8bU2cFsJ:www.rajasthantourindia.com/+site:rajasthantourindia.com&hl=en&ct=clnk&cd=1 ) the bot has visited my page on 18 May 2006 15:46:30 GMT. but it’s indexing only 1 page.

    After deleting the meta no-cache tags from all the pages google bot has visited this website around 3 times, but my rest of the pages is not getting indexed. I am not getting any clue for this, why It’s happening to my website.

    Hope to get your valuable answer / suggestion for my problem.

  73. Hi!

    I remuved my site from google using URL removal system.
    I removed only 3 subdomains and the main page, but I changed robots for all subdomains (12 subdomains). My site and all subdomains not in google.

    Now I want to add again in google.

    For subdomains removing wich removal system I need wait 6 month.

    How much time I need to wait for other subdomains?

    Thanks!

    Sorry for my english!

  74. I recently moved a site and dissalowed all robots until I got the stucture and the links as it was on the old site. It was only for a couple of days. Although it said the sitemap was downloaded only 20 minutes previous, it was using a cached version of my robots.txt file and wouldn’t accept my new sitemap because it said it was disallowed. I tried the tool thinking I could update the records with my new robots.txt file. It didn’t. I ran the tool over again hoping it would recognize the new sitemap. It didn’t. If you click on the robots.txt link on the top of the page, you get realtime results but you can’t use the tool to get realtime results on your actual sitemap. πŸ™ I think that would make this tool unique and different than the others on the web that do the same thing.

  75. Matt,

    Robots.txt should always be dynamic based on the user agent and only show this to the rest of the world::

    User-Agent: *
    Disallow: /

    The reason is, once you show all the bat bots what user agents you allow they simply cloak for those bot names and slide right thru your .htaccess filtering for bad bots like it didn’t exist. Show them nothing and they have to play hit ‘n miss with user agent names and expose themselves.

    BTW, why doesn’t Google post the exact IP ranged used by Googlebot so we can effectively lock out spoofers without risking locking out Google?

    To stop spoofers and being hijacked via proxies that cloak directories to Google, if you want examples let me know, I only allow Googlebot based on a known IP range for Googlebot OR if NSLOOKUP or WHOIS says the IP is owned by Google. Anything outside of that claiming to be Googlebot gets slapped in the face, but the official IP list of the crawlers would make life MUCH easier.

    FYI, I’ll be speaking at the Bot Obedience session πŸ˜‰

  76. I like the google sitemap tool, but i’m having a slight issue with it. For my site in particular the robots.txt file is larger (~3.5MB) than the sitemap tool will allow (i think it’s a 5000 character limit)….so the last line that it shows me in the text box is

    Dissalo

    and it’s giving me an error for that…

    so my question to you robots.txt gurus is a simple one:

    Is there a file size limit for a robots.txt file??

    thanks for the great wealth of info on this site.

  77. Robots.txt had no role in my case. Google spidered all of the pages. Looks like I’ll have to use the removal tool. πŸ™

  78. Just as a note:

    robots.txt tutorial and
    robots.txt validator

    are dead links already..

    Cheers
    Andreas

  79. Does anyone know whether META tags or does the ROBOTS.txt work better?

  80. ZZPrices:

    You can block access to complete directories or type of files using robots.txt. With the meta-tag on each page you can instruct a spider what to do: follow the links or not, cache the page or not, index it or not.

    So, if you don’t want your pages to be cached by search engines, you need to do that by the meta-tags. By preventing spider to index your images directory, you need the robots.txt.

    Different needs, different methodes

  81. Matt: I came across a robots.txt file which looks like this:

    User-agent: Googlebot-Image/1.0
    Sitemap: http://www.XXXX.com/Google_SiteMapIndex.xml

    Is this a valid syntax or a documented or undocumented feature?

  82. Happy New Year Folks,
    I am an seo’er turned newbie webmaster, designer =hanging on by my fingernails to the ‘leading edge of change’! This discussion on the robot files has me totally freaked out.///\/// so will the Googlebot squash me like a useless bug if I do not put that robot.txt file on my sites…for now, anyway? I mean all my sites are either just newborns or else undergoing heavy duty total redesigns et al. I could say ‘knowingly’ that it is premature but I don’t know :)) and frankly, am scared to death to even go there not withstanding all the great tools the Google team has provided.

    Also, I don’t want to do site maps either til I am done with all the changes in content and architecture….cause I think Googlebot will say ‘grooooan another woman changin her mind a million times a week’ and kick my googlebot but! :))

    In ALL SERIOUSNESS, while may of you are ‘old hats inwhite’ in terms of experience, I am not and the tools Google is offering webmasters RULE!
    I would be lost without them. So a huge thanx to the creative forces that bring these forward.

    I know your busy but a simple yes or no to my two questions would be appreciated. No need to publish this a simple e-mail would do. I deliberately did not include my sites as this is not an attempt to draw traffic or undue notice to them.

    All the best and thanx
    Linda

  83. What happens when you don’t have a robots.txt? I have noticed that Google don’t like when you don’t add a robots.txt too your website. First time I noticed it were when a client removed the robots.txt from their website (All directories should bee crawled). Bet then Googlebot started to look for the robots.txt and stopped when it could not find it. Or at least did it look like it. We also helped an other client with their Google Mini installation. They use the mini on their website and on their intranet. On their intranet they did not have an robots.txt and th mini did not crawl it until I put an robots.txt that said allow.

    How is it with the robots.txt. Must you have it even if you want Google to index everything?

  84. Does the rel=”nofollow” tag not harm the internet link pool?
    capchat are excellent, robots.txt files do there work aswell

    regards

    Frank

  85. I have searched the net to find a way to block one domain of two pointing to the same site with the Robots.txt but haven’t found any usefull info.

    I got a domainnamne with the .se suffix and also a domainname with the .nu suffix and both of them are pointing to the same site. I wish to block the .nu totaly to avoid duplbicate content. Is that possible to do with the robots.txt file?

    And if not.. Any suggestions of how to do instead? My webhotel isn’t doing much and I have asked them to make a 301-redirect from .nu to .se but they don’t do anything. It’s also on a windows server so I can’t use the .httaccess file.

    Google tools are great and free as said before. Although I think some are a little bit hard to use. Would be great to see some examples of how to use Google Analytics more properly and to see what one realy can do with that too.

    “Brett Tabke’s Search Engine World has a great robots.txt tutorial and even a robots.txt validator.” Can’t find any of them on the site.. Are they still there??

  86. the validator link is not working πŸ™ The requested URL /cgi-bin/robotcheck.cgi was not found on this server.

  87. Is it true that since the beginning of June the lack of a robots file causes googlebot to stop crawling? My site didn’t have one and returned a 404 and suddenly all my pages cannot be crawled now (I used to have over 10000 and daily crawling).

    I’ve fixed that tens day ago, but still the tool displays the (lack of file) cached on the beginning of June (though the link on the same page points to my new file).

    So what is the frequency of refreshing this robots file?

  88. Automation is a little scary unless it gives a full readout of the changes its make so we can back track if needed.

  89. Is the Bot a free down load?

  90. Youve got broken links in the top paragraph. πŸ™‚

  91. Hi, wonder if my question can be answered, it’s causing me some trouble!

    We have an internet forum and the google robot spends a load of its time crawling our calendar which is no use to us and we need it to stop doing that. We have disallowed the calendar in the robots.txt file but the robot still crawls it. I assume the file is incorrect but against all my research, it looks correct.

    It seems this is a common problem in the internet forum world and nobody seems to get any success, is there a known solution to this using the text file?

    Many Thanks.

  92. How do I tell googlebot to ignore trackback and feed links in my robots.txt ?

  93. Where can you find specific commands for the robot.txt file?

  94. Hi,

    When googlebot crawled my urls, they found 112 ulr’s that were restricted by robot.txt,

    Do you know how to fix it..Cause I want no restriction. I want google can crawl all my files.

    Thank you.

  95. Stevie,

    Perhaps try changing your robots.txt file to:
    User-agent: *
    Disallow: /forum/calendar_week.asp
    Disallow: /forum/calendar.asp
    Disallow: /forum/members.asp

  96. Seems searchenginewolrd have pulled their tools
    πŸ™

  97. As Alan Perkins and others mention, the url patterns ARE case sensitive.

    http://www.google.com/support/webmasters/bin/answer.py?answer=40360&query=robots.txt&topic=&type=

    Perhaps you could edit your post ever so slightly, as I trusted you more than the commenters and spent some time tracking this down: your blog post is ranked more highly than the official google answer (above) for a “robots.txt case-sensitive” google search!

    Thanks!

  98. I know that , you don’t want to restrict any user agents from crawling any files or directories – is there a point to having a robots.txt?

  99. Hi Matt..

    Am i looking in the right place?
    the link at the top of the page is broken..

    http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

    Goes to a dead page..

  100. Hi Matt i would like to block evrything aroung that url “archive.womansday.com”
    if i do that will it also block the main url womansday.com.

  101. It seems this is a common problem in the internet forum world and nobody seems to get any success, is there a known solution to this using the text file?

  102. thanks for this great site. i will look forward to read from u later

  103. Hello,

    Thank for providing such great information about robots.txt.

    I also block the pages through robots.txt & i added the code in
    ——————
    User-agent: *
    Dissallow : /*LinkClick.aspx?link=
    ————————-
    URL are come in
    http://www.alert-ims.com/LinkClick.aspx?link=103&tabid=127
    http://www.alert-ims.com/LinkClick.aspx?link=105&tabid=127

    in this pattern

    I want to block all URL which comes with this pattern

  104. Useful tool however the syntax for the robots.txt is rather simple.

  105. I realize this is an old thread but I had questions regarding the robots.txt file and this page came up as a first result. There’s a wealth of information in this thread that has definately answered my question. Thanks for the help and continued efforts to help us webmasters.

css.php