Robots.txt analysis tool

This is just a reminder that if you see a problem with your site, one of the first places you may want to look is our webmaster console. In some cases, Google can alert site owners in the webmaster console if we see an issue for things like hidden text. In a case that I just saw yesterday, the robots.txt analysis tool in the webmaster console was a huge help in solving a problem. Here’s an example of debugging a robots.txt issue.

Someone was asking about a particular result in our search results. The result didn’t show a description, and the “Cached” link was missing too. Often when I see that happen, it’s because the page wasn’t crawled. When I see that, the first thing I check out is the robots.txt file. Loading that in the browser showed me a file that looked like this:

# robots.txt for http://www.example.com

User-agent: *

User-agent: Wget
Disallow: /

At first glance, the robots.txt file looked okay, but I did notice one strange thing. Normally robots.txt files have pairs of “User-Agent:” and “Disallow:” lines, e.g.

User-agent: Googlebot
Disallow: /cgi-bin/

In this case, there was a “User-agent: *” by itself (which matches every search engine agent that abides by robots.txt), and the next directive was a “Disallow: /” (which blocks an entire site). I wasn’t positive how Google would treat that file, so I hopped over to the webmaster console and clicked on the “robots.txt analysis” link. I copied/pasted the robots.txt file into the text box as if I were going to use that robots.txt file on my own site. When I clicked “Check” here’s what Google told me:

Example of a site blocking itself with robots.txt

Sure enough, that “User-Agent: *” followed by the “Disallow: /” (even with a different user-agent in between) was enough for Googlebot not to crawl the site.

In a way, it makes sense. If you removed some whitespace in the robots.txt file, it could also look like

User-agent: *
User-agent: Wget
Disallow: /

and it’s pretty understandable that our crawler would interpret that conservatively.

The takeaway is that if you see a page show up as url-only with no snippet or cached page links, I’d check for problems with your robots.txt file first. The Google webmaster console also includes crawl errors; that can be another way to self-diagnose crawl issues as well.

P.S. I promised Vanessa that I’d mention that the robots.txt tool doesn’t support the autodiscovery aspect of sitemaps yet, but it will soon. 🙂 I’ll talk about autodiscovery and sitemaps at some point, but personally I think it’s a great development for site owners, because it makes it easier to tell many search engines about your site’s urls.

44 Responses to Robots.txt analysis tool (Leave a comment)

  1. Hi Matt
    I have been using this feature for a little while now and find it is aiding us well in the development of a portal which is still in it’s beta stage and producing thousands of pages where we utilize the robots text to great effect.

    The webmaster consul has aided us in identifying issues we have in the build and the way some pages are being rendered internally in the site. It’s a nice piece of kit and we have found it useful.

    What does concern us is the following, we are seeing important internal pages suddenly disappear from the index and we get a report that the page cannot be found. We tested the page and by all means it was sound as a bell, nothing in the page or in it’s structure was incorrect. It would have been useful if there was an explanation to why this happened, it had us scratching our heads in frustration trying to solve this mystery of a page that had been indexed well for a while and now suddenly disappears.

    Since then it has re-appeared in the index but still has us thoroughly a ta loss to why it happened.
    Obviously to know this info would help us minimize the occurrence of this happening in the future and especially as we are soon to go live with the finished design in three weeks, we would like to close all errors before then, and find the consul good for this, I just think more of a short descriptive explanation would be more appropriate than a generalization that does not take everything into account.

  2. thanks Matt, great descriptive post, definitely feedthebot worthy 🙂

  3. Matt,

    Thanks again. 🙂

  4. [quote]The result didn’t show a description, and the “Cached” link was missing too. Often when I see that happen, it’s because the page wasn’t crawled[/quote]

    I’m seeing this now with my site. When I click on the page and check the cached it is crawled and in the cache.

    Also I can reload the Google page and watch my site show sometimes and gone others.

    I just checked and my robots.txt is fine.

    What else would cause this problem?

    Thanks,

  5. That robots.txt file is invalid, but Googlebot’s “conservative” interpretation is questionable. A blank line is supposed to be a record separator, so another (as it turns out, correct in this case) interpretation would be that the file contains two records – one of which has a missing disallow field, and the other of which is correctly formed.

    Whichever the interpretation, surely the Webmaster console should mention the fact that the file is invalid?

  6. I agree with Alan Perkins. This robots.txt is not valid and imho Google’s interpretation is not in accordance with the robots.txt standard.

    Good that the Webmaster Console is there to clarify Google’s interpretations, anyway.

    Jean-Luc

  7. It does, however, throw an error on the robots file location. I read that Google said that it was going to adopt this standard from sitemaps.org but the Robots.txt analysis tool doesn’t appear to have been updated.

    Currently, it states “Syntax not understood” when you put in:
    Sitemap: http://mysite.com/sitemap.xml

    Will that hurt anything?

  8. What I remember being impressed with about this tool when I first used it was the ability it gave to play around and experiment. For someone seeking to see in real time how a robots.txt file affects a crawler. I really liked that you could put a page in there and see if that page is blocked or not when entering different versions of a robots.txt file.

    I recommend anyone who sorta, kinda knows what a robots.txt file is but isn’t sure to use this tool and set up different scenarios just to see what happens. This tool was of great use to me in learning about what robots.txt files could and could not do.

  9. Matt, the way Google supports robots.txt allows another website to insert text into a result title for blocked pages. Isn’t this a pretty serious flaw with potential legal ramafications? Check out the post on my blog about it.

  10. I hope you don’t mind me leaving this question here. First off I love the webtools and check it at least a few times per week. When I go to the diagnostics section, there are always a few web crawl errors under -not found-.

    Then when I click on the details it will list a few addresses with 404 – not found errors.

    The addresses are usually some weird variation that are similar to addresses I do have, such as an extra directory level added in, or a caps being inconsistent with the actual path.

    At first I aggressively would create 301 redirects for each of these paths, but have come to realize that I don’t know of any actual links that use these paths? The errrors seem to disapear over time.

    Can you give any insight as to where these errors come from?

    Can you also confirm that I can just ignore them, and eventually the bot will realize they are 404’s and not look for them?

    Can you also confirm that there is no downgrading of a site if the bot looks for pages that were never there in the first place?

    By the way, since when has 5 + 8 = 13. I could have sworn it was 11! 🙂

    Thanks!
    dk

  11. HI Matt,

    Thanks for all this One more thing I want to know that you guys use Allow in robots.txt . What is the advantage of ALLOW. I just checked your robots.txt file and fine an error here :

    Line 2 Allow:
    Unknown command. Acceptable commands are “User-agent” and “Disallow”.
    A robots.txt file doesn’t say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.

    Reference url: http://tool.motoricerca.info/robots-checker.phtml
    According Robots Exclusion Standard this is an syntax error. It means GoogleBot doesn’t follows robots standards ??? Is there any advantage of using Allow in robots.txt.

    Cheers
    TheSEOGuru

  12. Matt
    thanks for information, actually after create a robot text file I check it in robots checker site, when it say ok I finish it.
    But is it true that if we use robot text file uploading in my server then no need to add robot meta tag in any pages?
    the code is >>

    thanks
    Deb

  13. What I find interesting about this is that Google still lists pages that it considers disallowed in robots.txt. If I use robots.txt to keep Googlebot off certain pages, it’s because I don’t want them *indexed*, not just not crawled.

    Is there a way to tell Google not to index a particular page or pages? I thought it was robots.txt, but obviously I’m wrong.

    (Not that I actually DO want to exclude Google from any of my pages – perish the thought – but it would be useful to know how to do it if I had to)

  14. Oh, and Alan Perkins, serving up an invalid robots.txt is like serving up invalid (X)HTML – the recipient may try to guess what it thinks you meant, rather than what you actually said, because what you actually said is nonsense. It may well guess wrong, but it’s only a dumb computer after all. If you want it to do what you want, serve it valid code in the first place.

  15. Hi Matt,

    Interesting observation regarding robots.txt

    I was wondering how would the command be in the robots.txt if I want to prevent a file on a page from being indexed without blocking the whole page.

    Thanks for your answer in advance

    Cathy 🙂

  16. That’s one of the points I was making Chris.

    The wider point was that the topic in question is a “robots.txt analysis tool”, but it seems that the analysis tool failed to point out that the robots.txt file was invalid and that a conservative guess was being made about the author’s intent. That’s not great feedback from an analysis tool.

  17. Hi Matt!

    Great post! I was wondering how Google reacts to the allow feature. for example in a case where a file within a disallowed directory is Allowed.

    example:

    User-agent: *
    Allow: /store/images/
    Allow: /store/es/
    Disallow: /store/

    the Google tool says that the “/store/es/” and “/store/images/” is disallowed by line 4 (Disallow: /store/), however luckily they were actually indexed. I have changed it now, to be more precise, but my question is the following:

    Is this a glitch in the tool or is it a glitch in the Google bot not taking the disallow feature into account? Or am I just not understanding things properly 😉

    Thanks in advance Matt!

    Augustin

  18. I have to admit that I have been afraid of using the robots.txt for just this reason: The fear that a bot will visit my site (and my robots.txt file), read it wrong and not index my page. I am a bit fan of the Google Sitemaps tool (It provides tons of great information) and I understand how using this to debug the robots.txt file can solve this issue, but I am still too weary to actually create the file. Is sticking with nofollow’s enough to preserve your spidering enough? cheers…matt

  19. Are you kidding Keniki? I wasn’t going to mention this, but since you brought it up…

    Maybe “arrogant” is too harsh a word, but it’s the word that comes to my mind when I see a standard being re-written in this way. A backwards-compatible way of implementing auto-discovery would have been to use something like:

    #! Sitemap: http://www.mysite.com/sitemap.xml

    Instead, search engines have agreed on a “standard” that actually breaks the *standard*.

  20. Augustin,

    As you know, the “Allow:” directive is not part of the robots.txt standard. There is a description of Google’s implementation at http://www.google.fr/support/webmasters/bin/answer.py?answer=33575&topic=8460 .

    It says that Googlebot applies the “longest applicable rule”. According to your comment, the Google Webmaster Tool follows another interpretation. Glitch in the tool ?

    Jean-Luc

  21. Merci pour ta réponse Jean-Luc!

    Matt, are we on to something here??? Does the Google Robots.txt tool judge robots.txt’s the exact same way as the actual Googlebot?

  22. Excellent post Matt.

  23. a recent client called that his SERP result won’t show his meta description…i checked the meta and its okay..but when i checked the robots.txt…it contains a / in the Disallow…tsktsk…so i removed the / and hope that he’ll be crawled this time…

  24. Dear Matt

    On my website I have no robots.txt file, is it better to have this file for the SE.
    Really I don’t know….

    thanks in advance

    regards

    Frank

  25. You only need robots.txt if you have files that you wish to block from being spidered. Blocking them from being spidered does not prevent the URLs of those pages appearing in the SERPs as URL-only entries though.

  26. I would advise to have a robots.txt, even if it is just a blank one and merely there to prevent receiving a 404

    I even believe Matt did talk about that earlier, but cant find a posting…. Matt?

  27. I sooo needed this. I placed my robots file on the page and the result was not all that good. So I changed it and of course mucked it right up. I needed this post to show me where I went wrong. Thanks.

  28. Thanks G1smd & Tonnie

    Clear now, I will add a blank TXT to my website dir.

    regards

    Frank

  29. Matt,

    Long time lurker of your blog. Thank you for the Robots analysis tool, we have been using it on our sites and it has definately helped in my honest opinion.

    Thanks again!

  30. can any one suggest my robots text is wrong or not…..

    http://www.gymso.com/robots.txt

  31. Thanks matt,
    I have submitted my site to reinclusion..
    regards,
    Elixir Web Solutions

  32. Good post. I`ve been studying the “Help” that Google provides regarding Webmaster tools, and i think that all you need you find there, so for all users, before dooing optimization to an website, consult “Help”, it`s all in there, of course the tools provided by Google work very well and finnaly be carefull what and how are you writing, mistakes can occur when you are in a hurry.

  33. Any particular reason that you’ve disallowed ‘wget’ agent in your robots.txt file?

  34. hi,
    i was recently working on some robots.txt and found few people are linking their sitemap in that. is it correct to use such a tag.

    Sitemap: http://www.domainnamehere.com/sitemap.xml

    thanks,
    mohit

  35. I am new to the webmaster console and so far I think it is a great tool for managing the site, particularly this robot.txt analysis tool. Thanks Matt for this useful post!

  36. Hmm, interesting. I never thought that google would refuse to crawl a site because of an invalid robots.txt file, by common sense it would just be better to ignore the file.

  37. I am on Blogger, and my robots.text file is incorrect. It is excluding search pages – a problem many other bloggers have noticed since the robots.text file was introduced.

    I know what changes to make to the text, but not how to upload them to the root directory. I tried inserting the text in the template, but i doubt that will work.

    I am not sure Blogger allows us to upload HTML to blogger templates. Anyone have any advice?

  38. Is anyone else having problems with the Google webmaster tools robots.txt analyzer?

    I’ve noticed over a few weeks that it’s not working properly.
    It displays the correct robots.txt and you can check URL’s against that.
    However if you make any edits in the tool, running a check appears to use the cached version of the file rather than the locally edited version. Kind of defeats the purpose of the tool really.

    I hope Google fix this soon as when it works, it’s a God send!

  39. I am having a similar problem to AndySaid. Worse, Google’s cached version of robots.txt disallows access entirely to one of my sites (I recently implemented a new WordPress design redux but forgot to change WP’s privacy settings to allow search engines to access the site).

    In the two weeks since the redesign, I had been seeing URLs only with no snippets in the SERPs.

    Last night, I finally figured out what the problem was. Not taking any chances, I updated the WordPress settings AND manually uploaded a robots.txt file (the WordPress one is somehow invisible!).

    This morning, still no change. In webmaster tools, Google is still showing the old robots.txt. How long do I have to wait?

  40. I had a very frustrating experience with my robots.txt file that I wanted to share so nobody makes the same mistake. Practicing good SEO, I wanted to create a robots.txt file for improved results and to prevent the bots from searching unnecessary pages.

    After I created my robots.txt, using the one on wordpress as a template, I noticed none of my new pages were being indexed anymore. They use to be indexed immediately and then NOTHING.

    I had used these recommended lines:

    Disallow: /*?*
    Disallow: /*?

    I did this particulary because a survey form I use creates multiple pages with the ? that don’t need to be indexed.

    However, for some reason (I don’t know why), the googlebot sees many of my pages with this type of permalink: http://www.thisishowyoudoit.com/blog/?p=57

    However, my permalink structure looks like this: http://www.thisishowyoudoit.com/blog/10-reasons-why-not-to-host-your-wordpress-blog-on-a-windowsiis-platform/

    Thus, after implementing my robots.txt, many of my pages were not longer indexed.

    Just wanted to warn people about these particular lines in the robots.txt file.

    Thanks,
    Richard

  41. Hi Matt,

    My webmaster tool was indicating the following error with respect to my sitemap.xml. “URL timeout: robots.txt timeout
    We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.”

    My sitemap was generated by following the guidelines and it was definitely in the right location.

    Can you please help me to resolve this problem?
    Many thanks

  42. Hi Matt

    Hopefully you can assist me, we are dumbstruck by what has happend with our site we are developing currently we launced our indexing to google in July and we` got a very high ranking first page well a few weeks in an suddenly we have lost our indexing and placement but still in some search engines we are on the first page, the other strange thing that has happend is the page we were shown on is showing other companies with LINUXOS or Linux Os Solutions in there titles etc…

    What we have done is we have looked in our webmaster tools and we can see our rankings were high through google but now we are struggling to have our description etc, through google indexed, I have been looking at our robot txts etc but would appreciate if someone could explain what had happened..

    Many thanks for your time

  43. The webmaster tools page is doing something strange.

    Pages blocked by my robots.txt file are slowly being listed on the “Restricted by robots.txt” page. The number was slowly going up til it got to 23.
    It should have continued to go up as Google continued to index my site, since there are a lot more blocked pages.

    However, now the number is going down. It is at 22.
    So pages that are blocked are disappearing from the list.

    Is this normal? I don’t see how it can be.

    I have checkd the robots.txt file and url’s in Google and everything seems to be working properly.

    I also wish Google woudn’t take so long to index my entire site. It is not that big. Very frustrating.

  44. There are also robot.txt tools that allows you to experiment a little, letting you know if their are any problems with your file prior to putting it online.

css.php