noindex test

Pay no attention this to page with a noindex tag. I just want to check on how Yahoo/Live/Ask treat pages with noindex meta tags.

Update: Just for the curious, here’s how other search engines treated the noindex meta tag in 2006. Also note that this page is fine for search engines to index; it’s the destination page linked to above with the noindex meta tag.

32 Responses to noindex test (Leave a comment)

  1. Huge letdown! I see a new Matt post in my reader, but it’s just a test. 🙁

  2. Dave (original)

    Matt, in relation to http://www.mattcutts.com/blog/quick-comment-for-pixelrn/

    IF Google knows the page/site has been hacked, why do you punish the the victim and NOT the perpetrator?

  3. From here it looks like you only have no DMOZ description tag

  4. Dr. Otto Van DerWahl

    Gathering some intel, aye??

  5. This is not the post you are looking for *waves hand*

  6. This page is meant to have a noindex tag… right?
    Also Live reckons you need a meta tag like: , to not be indexed, which is different of course to what google/yahoo recommend.

  7. Dave (original), the victim’s site is the attack vector on innocent users. We want to reduce innocent users’ exposure to malware and hacked sites. We typically wouldn’t know who hacked the site, so we can’t really go after the perpetrator.

  8. Joshua Richardson, it’s the linked-to page that has a noindex tag. You might need to escape the Live tag or write it differently, too.

  9. http://www.mattcutts.com/robots.txt

    User-agent: *
    Disallow: /files/

    Have I missed something? I’m not seeing how the robots would even get to the page to handle the NOINDEX tag.

  10. Ryan Blakemore:
    User-agent: *
    Disallow: /files/
    =
    Search Engine can remember URL but shouldn’t index content.

    Noindex in metatag or in robots.txt or X-robot header
    = No index content and URL, I hope :).

  11. Matt – I can’t help but point out the irony that Google has indexed your test page with the meta noindex. Any explanations for this? Here’s a link for the SERP with the page indexed: http://bit.ly/InKXz

  12. Jim Gianoglio, that was part of my point. If you look at my robots.txt, you’ll see

    User-agent: *
    Disallow: /files/

    So Googlebot can’t access the file to see the noindex meta tag. Therefore (not knowing that the actual page has a noindex meta tag), we show the uncrawled url reference. Subtle, isn’t it?

  13. I’m really missing something here. How is this test going to show how the engines handle the meta noindex? If you found out how the meta tag was handled from this test, wouldn’t the results be more revealing of how the engines handle robots.txt?

  14. I think that tests and their results are important to communicate to webmasters. I believe that a large number of webmasters don’t understand the differences presented by no-index versus no-follow links versus robots.txt exclusions.

    @Ryan: I think what’s happening here is that Google is including a blank page reference because it can’t actually crawl the page (no-index), but there’s still incoming links with value to it.

  15. Since you’re in the mood for testing, what about using noindex in your robots.txt as Google allows to avoid uncrawled url references (or at least so webmaster help says)?

  16. Dave (Original)

    We typically wouldn’t know who hacked the site, so we can’t really go after the perpetrator.

    Ok, but it stilll begs the question on why the *victim is punished by Google* when Google KNOWS they are are the victim?

    It’s like the Police locking up a victim of crime.

  17. Dave (Original)

    Matt, just to add to my question above. I can see why Google would drop the site from the SERPs, IF other users can be infected with Malmare, but why punish the victim when NO malware and no risk to any other users? E.g links to bad neighborhoods etc..

  18. Yeah Matt. Dave has a point. Plus, Google knows all. They are 1984.

  19. If Google is not able to access the page due to the Robots.txt file blocking it, and the meta is set to noindex, how can Google, in good faith, recommend similiar pages?
    http://www.google.com/search?hl=en&q=related:www.mattcutts.com/files/noindex-test.html

  20. Thanks a lot Matt.I’ll try this. But from here it looks like you only have no DMOZ description tag.:

  21. Dave (Original)

    Because the “similiar pages” aren’t blocked and Google still spiders NoIndex.

  22. Dave, but the noindex shouldn’t be spidered since it’s blocked by robots.txt.

  23. So Matt, how are the others treating the noindex? G, Y and M haven’t indexed it.

    what else can you tell about it?

  24. huh? No spam protection anymore?

  25. Dave (Original)

    Dave, but the noindex shouldn’t be spidered since it’s blocked by robots.txt.

    I completely agree and have mentioned this to Matt about robots.txt and noindex is his post ASKING for opinions on the subject.

    Unfortunately Google is a GIANT that often gets its own way, even when it’s morally wrong.

  26. lol, page is indexed, so noindex wont work 😀

  27. Page isn’t spidered because it’s blocked, but it is added to index (with a best guess at page title) due to links. Take delicious.com as an example. Same thing is happening there. Meta noindex on the home page, but all files blocked by robots. So you do a search on Google for “delicious” and the new home page shows up, but with no snippet or meta description because the page wasn’t crawled, but there is a page title associated with it, not because it actually is the page title, but because it’s the domain name and it is used in anchor text (google will attribute a page title to any page that doesn’t have one, and it’s often the name of the root).

    If you do search for “delicious” on MSN, the page doesn’t come up, so it appears that MSN is accessing the file (despite what robots.txt says) and finds the noindex meta.

    Yahoo! gives page title and a description, which actually isn’t in the meta description or found anywhere on the page. It’s pulled from the Yahoo! directory listing of the site. Yahoo! tends to assign Y directory data to pages that don’t have it (and even often times when they do). So it would it appear that Y also follows the robots.txt directive. Of course delicious.com is a Yahoo! property, so you could draw a different conclusion, but this is my thought on the subject.

  28. So Matt, I guess the question is “How can Google display similar pages to a page that has been blocked by robots.txt?”

  29. Dave (Original)

    I just want to check on how Yahoo/Live/Ask treat pages with noindex meta tags.

    Yahoo at least does the same as Google, they ignore the tag, flex their muscle and Index it. I guess might is right?

  30. Matt, please see if your associates would enhance the URL removal tool to remove https pages or permit http or https to be used as the preferred domain.

    A client has a site, running on the worst CMS I have come across. No root access is provided and it is hosted on a Windows server. 🙁 So I have no ability to add meta tags, can’t serve separate robots.txt files, and can’t even use 301’s. What a nightmare and the company providing this horrible service is not compelled to help.

    If we had the ability to remove just the https pages, or set the preferred domain to http or https, this could quickly solve some problems and remove duplicate content from the index. As it stands now, the homepage is indexed with https, http, http://www, and previously had http://www.domain/default.asp. The client wonders why their homepage no longer has its first page listings in Google Search…

    Have a good day Matt. 🙂

  31. could’nt help but pay attention to the Interesting results, seeing it indexed by both Google, Ask, Live and Yahoo. Means as a best practice as a webmaster you must take note to double check robots.txt is not blocked if going to use no follow meta. But it still seems that just using robot.txt still seems to be the best method.

  32. Hi Matt,

    As Bill mentioned
    >> Matt, please see if your associates would enhance the URL removal tool to remove https pages or permit http or https to be used as the preferred domain.

    Can we have some preferred option for http and https under webmaster console or even as a new parameter to robots.txt. It is sometimes so difficult to have two robots.txt seomoz.org/ugc/solving-duplicate-content-issues-with-http-and-https?

    It will help a lot of people, looking for a reply, please do consider answering it, I will also appreciate if I get to know that I am answered 🙂 (Though I will keep following it up under this blog post)

    Big thanks for all your hard work Matt.

    Regards,
    Aji

css.php