Robots.txt analysis tool

This is just a reminder that if you see a problem with your site, one of the first places you may want to look is our webmaster console. In some cases, Google can alert site owners in the webmaster console if we see an issue for things like hidden text. In a case that I just saw yesterday, the robots.txt analysis tool in the webmaster console was a huge help in solving a problem. Here’s an example of debugging a robots.txt issue.

Someone was asking about a particular result in our search results. The result didn’t show a description, and the “Cached” link was missing too. Often when I see that happen, it’s because the page wasn’t crawled. When I see that, the first thing I check out is the robots.txt file. Loading that in the browser showed me a file that looked like this:

# robots.txt for http://www.example.com

User-agent: *

User-agent: Wget
Disallow: /
...

At first glance, the robots.txt file looked okay, but I did notice one strange thing. Normally robots.txt files have pairs of “User-Agent:” and “Disallow:” lines, e.g.

User-agent: Googlebot
Disallow: /cgi-bin/

In this case, there was a “User-agent: *” by itself (which matches every search engine agent that abides by robots.txt), and the next directive was a “Disallow: /” (which blocks an entire site). I wasn’t positive how Google would treat that file, so I hopped over to the webmaster console and clicked on the “robots.txt analysis” link. I copied/pasted the robots.txt file into the text box as if I were going to use that robots.txt file on my own site. When I clicked “Check” here’s what Google told me:

Example of a site blocking itself with robots.txt

Sure enough, that “User-Agent: *” followed by the “Disallow: /” (even with a different user-agent in between) was enough for Googlebot not to crawl the site.

In a way, it makes sense. If you removed some whitespace in the robots.txt file, it could also look like

User-agent: *
User-agent: Wget
Disallow: /

and it’s pretty understandable that our crawler would interpret that conservatively.

The takeaway is that if you see a page show up as url-only with no snippet or cached page links, I’d check for problems with your robots.txt file first. The Google webmaster console also includes crawl errors; that can be another way to self-diagnose crawl issues as well.

P.S. I promised Vanessa that I’d mention that the robots.txt tool doesn’t support the autodiscovery aspect of sitemaps yet, but it will soon. :) I’ll talk about autodiscovery and sitemaps at some point, but personally I think it’s a great development for site owners, because it makes it easier to tell many search engines about your site’s urls.

Related Posts:
  • New robots.txt tool
    The Sitemaps team just introduced a new robots.txt tool into Sitemaps. The robots.txt file is one of the easiest things for a webmaster to make...
  • Google provides backlink tool for site owners
    One of the common requests I hear from webmasters is "Why doesn't Google show me most or all of my backlinks?" Well, as of today,...
  • Handling noindex meta tags
    Okay, here's a question. I did the search [congoo] recently and didn't get the home page of Congoo--why not? If you view the source of...
  • Googlebot: Keep out!
    Okay, that last post was pretty earnest, so I feel the need to post something really technical now. At SES New York, someone asked "Why...

40 Comments »

  1. Vincent Said,

    April 16, 2007 @ 10:53 am

    Hi Matt
    I have been using this feature for a little while now and find it is aiding us well in the development of a portal which is still in it’s beta stage and producing thousands of pages where we utilize the robots text to great effect.

    The webmaster consul has aided us in identifying issues we have in the build and the way some pages are being rendered internally in the site. It’s a nice piece of kit and we have found it useful.

    What does concern us is the following, we are seeing important internal pages suddenly disappear from the index and we get a report that the page cannot be found. We tested the page and by all means it was sound as a bell, nothing in the page or in it’s structure was incorrect. It would have been useful if there was an explanation to why this happened, it had us scratching our heads in frustration trying to solve this mystery of a page that had been indexed well for a while and now suddenly disappears.

    Since then it has re-appeared in the index but still has us thoroughly a ta loss to why it happened.
    Obviously to know this info would help us minimize the occurrence of this happening in the future and especially as we are soon to go live with the finished design in three weeks, we would like to close all errors before then, and find the consul good for this, I just think more of a short descriptive explanation would be more appropriate than a generalization that does not take everything into account.

  2. feedthebot.com Said,

    April 16, 2007 @ 12:14 pm

    thanks Matt, great descriptive post, definitely feedthebot worthy :)

  3. Tonnie Said,

    April 16, 2007 @ 12:16 pm

    Matt,

    Thanks again. :)

  4. BobR Said,

    April 16, 2007 @ 12:25 pm

    [quote]The result didn’t show a description, and the “Cached” link was missing too. Often when I see that happen, it’s because the page wasn’t crawled[/quote]

    I’m seeing this now with my site. When I click on the page and check the cached it is crawled and in the cache.

    Also I can reload the Google page and watch my site show sometimes and gone others.

    I just checked and my robots.txt is fine.

    What else would cause this problem?

    Thanks,

  5. Alan Perkins Said,

    April 16, 2007 @ 12:31 pm

    That robots.txt file is invalid, but Googlebot’s “conservative” interpretation is questionable. A blank line is supposed to be a record separator, so another (as it turns out, correct in this case) interpretation would be that the file contains two records - one of which has a missing disallow field, and the other of which is correctly formed.

    Whichever the interpretation, surely the Webmaster console should mention the fact that the file is invalid?

  6. Jean-Luc Said,

    April 16, 2007 @ 12:46 pm

    I agree with Alan Perkins. This robots.txt is not valid and imho Google’s interpretation is not in accordance with the robots.txt standard.

    Good that the Webmaster Console is there to clarify Google’s interpretations, anyway.

    Jean-Luc

  7. Douglas Karr Said,

    April 16, 2007 @ 1:03 pm

    It does, however, throw an error on the robots file location. I read that Google said that it was going to adopt this standard from sitemaps.org but the Robots.txt analysis tool doesn’t appear to have been updated.

    Currently, it states “Syntax not understood” when you put in:
    Sitemap: http://mysite.com/sitemap.xml

    Will that hurt anything?

  8. feedthebot.com Said,

    April 16, 2007 @ 1:11 pm

    What I remember being impressed with about this tool when I first used it was the ability it gave to play around and experiment. For someone seeking to see in real time how a robots.txt file affects a crawler. I really liked that you could put a page in there and see if that page is blocked or not when entering different versions of a robots.txt file.

    I recommend anyone who sorta, kinda knows what a robots.txt file is but isn’t sure to use this tool and set up different scenarios just to see what happens. This tool was of great use to me in learning about what robots.txt files could and could not do.

  9. teddie Said,

    April 16, 2007 @ 4:35 pm

    Matt, the way Google supports robots.txt allows another website to insert text into a result title for blocked pages. Isn’t this a pretty serious flaw with potential legal ramafications? Check out the post on my blog about it.

  10. Dr. David Klein Said,

    April 16, 2007 @ 8:02 pm

    I hope you don’t mind me leaving this question here. First off I love the webtools and check it at least a few times per week. When I go to the diagnostics section, there are always a few web crawl errors under -not found-.

    Then when I click on the details it will list a few addresses with 404 - not found errors.

    The addresses are usually some weird variation that are similar to addresses I do have, such as an extra directory level added in, or a caps being inconsistent with the actual path.

    At first I aggressively would create 301 redirects for each of these paths, but have come to realize that I don’t know of any actual links that use these paths? The errrors seem to disapear over time.

    Can you give any insight as to where these errors come from?

    Can you also confirm that I can just ignore them, and eventually the bot will realize they are 404’s and not look for them?

    Can you also confirm that there is no downgrading of a site if the bot looks for pages that were never there in the first place?

    By the way, since when has 5 + 8 = 13. I could have sworn it was 11! :)

    Thanks!
    dk

  11. TheSEOGuru Said,

    April 16, 2007 @ 8:45 pm

    HI Matt,

    Thanks for all this One more thing I want to know that you guys use Allow in robots.txt . What is the advantage of ALLOW. I just checked your robots.txt file and fine an error here :

    Line 2 Allow:
    Unknown command. Acceptable commands are “User-agent” and “Disallow”.
    A robots.txt file doesn’t say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.

    Reference url: http://tool.motoricerca.info/robots-checker.phtml
    According Robots Exclusion Standard this is an syntax error. It means GoogleBot doesn’t follows robots standards ??? Is there any advantage of using Allow in robots.txt.

    Cheers
    TheSEOGuru

  12. Deb Said,

    April 16, 2007 @ 11:32 pm

    Matt
    thanks for information, actually after create a robot text file I check it in robots checker site, when it say ok I finish it.
    But is it true that if we use robot text file uploading in my server then no need to add robot meta tag in any pages?
    the code is >>

    thanks
    Deb

  13. Chris Hunt Said,

    April 17, 2007 @ 1:41 am

    What I find interesting about this is that Google still lists pages that it considers disallowed in robots.txt. If I use robots.txt to keep Googlebot off certain pages, it’s because I don’t want them *indexed*, not just not crawled.

    Is there a way to tell Google not to index a particular page or pages? I thought it was robots.txt, but obviously I’m wrong.

    (Not that I actually DO want to exclude Google from any of my pages - perish the thought - but it would be useful to know how to do it if I had to)

  14. Chris Hunt Said,

    April 17, 2007 @ 1:49 am

    Oh, and Alan Perkins, serving up an invalid robots.txt is like serving up invalid (X)HTML - the recipient may try to guess what it thinks you meant, rather than what you actually said, because what you actually said is nonsense. It may well guess wrong, but it’s only a dumb computer after all. If you want it to do what you want, serve it valid code in the first place.

  15. Cathy Said,

    April 17, 2007 @ 2:12 am

    Hi Matt,

    Interesting observation regarding robots.txt

    I was wondering how would the command be in the robots.txt if I want to prevent a file on a page from being indexed without blocking the whole page.

    Thanks for your answer in advance

    Cathy :)

  16. Alan Perkins Said,

    April 17, 2007 @ 3:21 am

    That’s one of the points I was making Chris.

    The wider point was that the topic in question is a “robots.txt analysis tool”, but it seems that the analysis tool failed to point out that the robots.txt file was invalid and that a conservative guess was being made about the author’s intent. That’s not great feedback from an analysis tool.

  17. Augustin Said,

    April 17, 2007 @ 11:18 am

    Hi Matt!

    Great post! I was wondering how Google reacts to the allow feature. for example in a case where a file within a disallowed directory is Allowed.

    example:

    User-agent: *
    Allow: /store/images/
    Allow: /store/es/
    Disallow: /store/

    the Google tool says that the “/store/es/” and “/store/images/” is disallowed by line 4 (Disallow: /store/), however luckily they were actually indexed. I have changed it now, to be more precise, but my question is the following:

    Is this a glitch in the tool or is it a glitch in the Google bot not taking the disallow feature into account? Or am I just not understanding things properly ;)

    Thanks in advance Matt!

    Augustin

  18. Matthew Bredel Said,

    April 17, 2007 @ 8:14 pm

    I have to admit that I have been afraid of using the robots.txt for just this reason: The fear that a bot will visit my site (and my robots.txt file), read it wrong and not index my page. I am a bit fan of the Google Sitemaps tool (It provides tons of great information) and I understand how using this to debug the robots.txt file can solve this issue, but I am still too weary to actually create the file. Is sticking with nofollow’s enough to preserve your spidering enough? cheers...matt

  19. Alan Perkins Said,

    April 18, 2007 @ 1:40 am

    Are you kidding Keniki? I wasn’t going to mention this, but since you brought it up...

    Maybe “arrogant” is too harsh a word, but it’s the word that comes to my mind when I see a standard being re-written in this way. A backwards-compatible way of implementing auto-discovery would have been to use something like:

    #! Sitemap: http://www.mysite.com/sitemap.xml

    Instead, search engines have agreed on a “standard” that actually breaks the *standard*.

  20. Jean-Luc Said,

    April 18, 2007 @ 7:38 am

    Augustin,

    As you know, the “Allow:” directive is not part of the robots.txt standard. There is a description of Google’s implementation at http://www.google.fr/support/webmasters/bin/answer.py?answer=33575&topic=8460 .

    It says that Googlebot applies the “longest applicable rule”. According to your comment, the Google Webmaster Tool follows another interpretation. Glitch in the tool ?

    Jean-Luc

  21. Augustin Said,

    April 18, 2007 @ 8:48 am

    Merci pour ta réponse Jean-Luc!

    Matt, are we on to something here??? Does the Google Robots.txt tool judge robots.txt’s the exact same way as the actual Googlebot?

  22. Scott Said,

    April 18, 2007 @ 11:21 am

    Excellent post Matt.

  23. reah Said,

    April 19, 2007 @ 1:12 am

    a recent client called that his SERP result won’t show his meta description...i checked the meta and its okay..but when i checked the robots.txt...it contains a / in the Disallow...tsktsk...so i removed the / and hope that he’ll be crawled this time...

  24. frank Said,

    April 20, 2007 @ 1:24 am

    Dear Matt

    On my website I have no robots.txt file, is it better to have this file for the SE.
    Really I don’t know....

    thanks in advance

    regards

    Frank

  25. g1smd Said,

    April 20, 2007 @ 3:55 am

    You only need robots.txt if you have files that you wish to block from being spidered. Blocking them from being spidered does not prevent the URLs of those pages appearing in the SERPs as URL-only entries though.

  26. Tonnie Said,

    April 20, 2007 @ 4:03 am

    I would advise to have a robots.txt, even if it is just a blank one and merely there to prevent receiving a 404

    I even believe Matt did talk about that earlier, but cant find a posting.... Matt?

  27. TerryG Said,

    April 21, 2007 @ 4:32 am

    I sooo needed this. I placed my robots file on the page and the result was not all that good. So I changed it and of course mucked it right up. I needed this post to show me where I went wrong. Thanks.

  28. Frank Said,

    April 22, 2007 @ 6:07 am

    Thanks G1smd & Tonnie

    Clear now, I will add a blank TXT to my website dir.

    regards

    Frank

  29. Paul Said,

    April 23, 2007 @ 6:17 pm

    Matt,

    Long time lurker of your blog. Thank you for the Robots analysis tool, we have been using it on our sites and it has definately helped in my honest opinion.

    Thanks again!

  30. Gymso Said,

    April 24, 2007 @ 1:34 am

    can any one suggest my robots text is wrong or not.....

    http://www.gymso.com/robots.txt

  31. Elixirwebsolutions Said,

    April 24, 2007 @ 4:39 am

    Thanks matt,
    I have submitted my site to reinclusion..
    regards,
    Elixir Web Solutions

  32. Mihai Said,

    April 26, 2007 @ 2:25 am

    Good post. I`ve been studying the “Help” that Google provides regarding Webmaster tools, and i think that all you need you find there, so for all users, before dooing optimization to an website, consult “Help”, it`s all in there, of course the tools provided by Google work very well and finnaly be carefull what and how are you writing, mistakes can occur when you are in a hurry.

  33. AG Said,

    April 27, 2007 @ 3:04 pm

    Any particular reason that you’ve disallowed ‘wget’ agent in your robots.txt file?

  34. linux host india Said,

    May 28, 2007 @ 3:49 am

    hi,
    i was recently working on some robots.txt and found few people are linking their sitemap in that. is it correct to use such a tag.

    Sitemap: http://www.domainnamehere.com/sitemap.xml

    thanks,
    mohit

  35. The Dog Clothing Company Said,

    July 11, 2007 @ 9:10 am

    I am new to the webmaster console and so far I think it is a great tool for managing the site, particularly this robot.txt analysis tool. Thanks Matt for this useful post!

  36. reliable web hosting Said,

    August 7, 2007 @ 8:26 am

    Hmm, interesting. I never thought that google would refuse to crawl a site because of an invalid robots.txt file, by common sense it would just be better to ignore the file.

  37. bkkdreamer Said,

    August 11, 2007 @ 8:06 am

    I am on Blogger, and my robots.text file is incorrect. It is excluding search pages - a problem many other bloggers have noticed since the robots.text file was introduced.

    I know what changes to make to the text, but not how to upload them to the root directory. I tried inserting the text in the template, but i doubt that will work.

    I am not sure Blogger allows us to upload HTML to blogger templates. Anyone have any advice?

  38. Andy Said,

    February 26, 2008 @ 6:52 am

    Is anyone else having problems with the Google webmaster tools robots.txt analyzer?

    I’ve noticed over a few weeks that it’s not working properly.
    It displays the correct robots.txt and you can check URL’s against that.
    However if you make any edits in the tool, running a check appears to use the cached version of the file rather than the locally edited version. Kind of defeats the purpose of the tool really.

    I hope Google fix this soon as when it works, it’s a God send!

  39. Michael Heraghty Said,

    February 28, 2008 @ 2:15 am

    I am having a similar problem to AndySaid. Worse, Google’s cached version of robots.txt disallows access entirely to one of my sites (I recently implemented a new Wordpress design redux but forgot to change WP’s privacy settings to allow search engines to access the site).

    In the two weeks since the redesign, I had been seeing URLs only with no snippets in the SERPs.

    Last night, I finally figured out what the problem was. Not taking any chances, I updated the Wordpress settings AND manually uploaded a robots.txt file (the Wordpress one is somehow invisible!).

    This morning, still no change. In webmaster tools, Google is still showing the old robots.txt. How long do I have to wait?

  40. Richard Said,

    June 30, 2008 @ 4:42 pm

    I had a very frustrating experience with my robots.txt file that I wanted to share so nobody makes the same mistake. Practicing good SEO, I wanted to create a robots.txt file for improved results and to prevent the bots from searching unnecessary pages.

    After I created my robots.txt, using the one on wordpress as a template, I noticed none of my new pages were being indexed anymore. They use to be indexed immediately and then NOTHING.

    I had used these recommended lines:

    Disallow: /*?*
    Disallow: /*?

    I did this particulary because a survey form I use creates multiple pages with the ? that don’t need to be indexed.

    However, for some reason (I don’t know why), the googlebot sees many of my pages with this type of permalink: http://www.thisishowyoudoit.com/blog/?p=57

    However, my permalink structure looks like this: http://www.thisishowyoudoit.com/blog/10-reasons-why-not-to-host-your-wordpress-blog-on-a-windowsiis-platform/

    Thus, after implementing my robots.txt, many of my pages were not longer indexed.

    Just wanted to warn people about these particular lines in the robots.txt file.

    Thanks,
    Richard

RSS feed for comments on this post

Got a webmaster-related question or suggestion that is not directly related to the topic of this entry? Instead of posting it here, your best bet is our official Google forum linked from http://www.google.com/webmasters/

Also, I pre-moderate first-time commenters. Please review my comment policy before leaving a comment.