What should NOINDEX do?
Okay, this post will be colossally boring to some people. But I wanted to give you a peek at debates behind the curtain in Google’s search quality group. Here’s a policy discussion about NOINDEX and how Google should treat the NOINDEX meta tag. First, you’ll want to read this post about how Google handles the NOINDEX meta tag. You may also want to watch this video about how to remove your content from Google or prevent it from being indexed in the first place. Here’s the conclusion from my earlier blog post:
So based on a sample size of one page, it looks like search engines handle the “NOINDEX” meta tag:
- Google doesn’t show the page in any way
- Ask doesn’t show the page in any way
- MSN shows a url reference and Cached link, but no snippet. Clicking the cached link doesn’t return anything.
- Yahoo! shows a url reference and Cached link, but no snippet. Clicking on the cached link returns the cached page.
The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between? Let me lay out the arguments for each:
Completely drop a NOINDEX’ed page
This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters. The only corner case is that if Google sees a link to a page A but doesn’t actually crawl the page, we won’t know that page A has a NOINDEX tag and we might show the page as an uncrawled url. There’s an interesting remedy for that: currently, Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site urls from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)
Webmasters sometimes shoot themselves in the foot by using NOINDEX, but if a site’s traffic from Google is very low, the webmaster will be motivated to diagnose the issue themselves. Plus we could add a NOINDEX check into the webmaster console to help webmasters self-diagnose if they’ve removed their own site with NOINDEX. The NOINDEX meta tag serves a useful role that’s different than robots.txt, and the tag is far enough off the beaten path that few people use the NOINDEX tag by mistake.
Show a link/reference to NOINDEX’ed pages
Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like
- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)
aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).
Some middle ground in between
The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.
What do you think?
That’s the internal discussion that we’ve been having about NOINDEX meta tags. Now I’m curious what you think. Here’s a poll:
How should Google treat the NOINDEX meta tag?
I’d also be interested in (constructive) suggestions in the comments about how Google should treat the NOINDEX meta tag. Try to step into both a regular user’s shoes as well as the position of a site owner before leaving a comment.
Matt Cutts Said,
February 24, 2008 @ 8:42 pm
By the way, if you’re going to be at SMX West this week, don’t be surprised if I ask your opinion on this issue.
EGOL Said,
February 24, 2008 @ 9:38 pm
I can’t understand why there is a debate on this topic?
The webpage is the webmaster’s property. He placed the “noindex” on the page. That is a denial of content. Clearly communicated.
I think that it is disrespectful for a search engine to index what a webmaster asks not to have indexed. Google has “webmaster guidelines” and hopes that they are respected. What makes this any different?
I think it could be compared to a “no trespassing” sign on real estate.
MVelie Said,
February 24, 2008 @ 9:39 pm
I think google should display a link; however, that link should not count towards page rank for the page. From what I remember this was used to help get rid of blog spam. If it does not count towards rank then it still will cut down on rank, but will allow the site to still show up on the index.
SearcH◆ EngineS WEB Said,
February 24, 2008 @ 9:42 pm
How is this policy any different then when Google BANNED an entire domain because of even a small violation of the Webmaster guidelines????
Think about it!!!
There was no concern about the user experience THEN - until SearchEnginesWEB started tenaciously protesting the policy.
_______________________________________________________
However, in reference to suggestions for the NOINDEX:
1- One solution would be to develop a new extension to the NOINDEX tags.
Allow Webmasters to use a proprietary NOINDEX tag that would differentiate between a complete disappearance in Google’s SERPs and a referenced link. This would give Webmasters the choice
Perhaps it could be named NOINDEXDOMAIN
2- Another solution would be to ALLOW the site to be referenced in the index - BUT:
a- Do not LINK it -
b- Put a STRIKE Strike Through It
c- Make The Link GRAY instead of the Blue
d- Put ‘Expired’ after the domain
Bryce Said,
February 24, 2008 @ 9:49 pm
I have always interpreted NOINDEX to mean just that, I do not want Google to index this page, à la the Deny restriction in a robots.txt
adam Said,
February 24, 2008 @ 9:57 pm
noindex should mean exactly what it says it is. I agree with EGOL on this. Stick to the rules YOU gave US.
Pratheep Said,
February 24, 2008 @ 10:06 pm
Matt,
Is there any way to block the bots from indexing a part of a page?
Pratheep
EGOL Said,
February 24, 2008 @ 10:19 pm
Imagine a world with 50 really important search engines. I don’t want my content included. I should be able to accomplish that with a simple noindex instruction. I should not be expected to go to each search engine that exists and each one that springs up to keep my content out.
Going back to the “no trespassing” sign comparison… if Google was a taxi service would they drive their clients across posted boundaries - without their knowledge? No they would not.
However, that “no trespassing” sign is visible to the public. And, in the case of real estate anybody who walks past it can see it and there is no law against commenting to a friend…. “That cranky old EGOL has a no trespassin’ sign on his land”.
So, if a search engine user types a domain or a page URL into the query box, it would be OK for google to have a special page that says… “The cranky old fart that owns this page says Keep Out”. However, that page should not contain a clickthrough link. That is like opening the gate to the posted land.
In the case of real estate, any person who isn’t blind can see across the property boundary and notice objects on my land. They can talk about them freely - no laws against it. This is the point where I believe that the technology of a search engine departs from the real estate example. When the spider arrives and requests the “noindexed” page, that noindex instruction means: “Don’t look at this” So that page should not appear in any SERP - even if that SERP is relevant. This is the same as land with a privacy fence. Google should not know what is in there.
Jan Said,
February 24, 2008 @ 10:31 pm
Hi Matt,
Don’t be evil… if a Webseite uses the noindex-Tag it means in my opinion that the webmaster doesn’t want the site in the Searchengines index.
)
If he set the metatag state to noindex by mistake…. bad luck for him.
Google could sent him an email like: were interested in putting your website in our search index, would you like to be in it? (possible Spam-Mail??
So don’t force Websites to show up in the Google SERPs when metatag ist set.
Greetings from Germany
Jan
jonah stein Said,
February 24, 2008 @ 10:41 pm
Matt
This one seems pretty simple. If the page has a no index, don’t include it in your index. While it is conceivable that someone accidentally inserts the meta noindex, it takes an overt act to do so, rather than an omission.
I suppose the rule might arguably be slightly different if the entire site has a no index and someone is do a navigational search. In that case including a link with no description or title could be appropriate.
Dave (original) Said,
February 24, 2008 @ 10:52 pm
Matt, why raise the subject when Google is only going to do what is the best interest of Google & its shareholders, regardless?
It SHOULD be no-brainer and in compliance with Google’s OWN guidelines, which it expects Webmasters to follow;
There is ONLY ONE RIGHT THING TO DO.
Maciej Głuszek Said,
February 24, 2008 @ 11:38 pm
I agree with EGOL.
Since webmaster placed a noindex tag to their website, that means he doesn’t want this website to show up on search engines.
You say that numbers are small..ok, so maybe a good way would be to show some info about blocked (by noindex) websites in the webmasters console.
“Not returning the right link because of a NOINDEX tag, it hurts the user experience” - so don’t show it..
You cannot judge whether a website is accidentally blocked or noindex is placed on purpose.
And going after more and more content regardless of your policies and users agreement won’t get you anywhere.
So, I’d vote for not showing the page at all.
Regards
Harith Said,
February 24, 2008 @ 11:50 pm
Matt,
Tell us about a middle ground. In the case:
Google to not show that page at all, but it still following the links on that page.
Thoughts?
Michael D Said,
February 24, 2008 @ 11:51 pm
I’m confused as to why a well known site would not be showing up. If it were a case where a site only wanted direct traffic and didn’t want to be indexed I’d expect that request should be honored.
I’m thinking middle ground but likely different from what you’re discussing. I know I’ve mistakenly used noindex tags in the past, I’d assume other new webmasters have as well.
Maybe there needs to be an extra strength noindex.
Craig Said,
February 24, 2008 @ 11:56 pm
While 99% of me wants to agree with the majority on this one and say it should completely stay out of your index I can see where Google are coming from with the accidental blocking.
Because of this I am in 100% agreement (that’s 1% more!) with Searn Engine Web, a new tag should be developed that IS a middle ground, a “don’t index me at the moment” or “Revisit for index on..” so that people who are about to put a site live can place this tag on pages and people who do not ant content indexed can use good ol’ NOINDEX.
Chris Estes Said,
February 24, 2008 @ 11:59 pm
If the noindex is there that means the y don’t want it to appear. Any time I use it I don’t want the page to be seen in the indexes. But if that is the case then they should also block it with the robots.txt file. I imagine this can be mistaken and clicked a button in their content managment system for development and someone forgot to turn it back on. Opps Take the whole site down great. That is an easy thing to do especially if you are devloping in a live environment.
When a site has noindex don’t show it at all. or if google wants to they can make phone calls to see if they really want it removed. I would give my number out for a call from the google stars.
Harith Said,
February 25, 2008 @ 12:08 am
Here I go again (the tag didn’t show in my previous post)
Matt,
Tell us about a middle ground. In the case:
Google to not show that page at all, but it still following the links on that page.
[meta name="robots" content="noindex, follow" /]
rick gregory Said,
February 25, 2008 @ 12:25 am
Yeah, this seems easy if you interpret NOINDEX to mean ‘don’t tell anyone the page exists.’ Semantically it could mean ‘don’t spider the page and cache the content’ while being neutral about whether the search engine can reference the URL in its results.
However, if the historical meaning has been ‘exclude this page from the index entirely’ then that’s what Google should do. The fact that some people forget to remove this from a staging site is THEIR issue. Google should not babysit them and, no, it doesn’t matter how high profile the site is.
Joost de Valk Said,
February 25, 2008 @ 12:59 am
Hey Matt,
very pleased to see you guys addressing this issue, and I hope other search engines will follow suit, resulting in one way all the engines treat noindex. Personally my opinion is that no search engine should display a page when it has noindex on it. Even more, I find the fact that most search engines, including google, seem to automatically assume a follow after the noindex if there’s no explicit nofollow to be very weird.
Try explaining to someone who doesn’t know about this stuff the following: “yes we prevented Google from showing that page in their index”. Two minutes later, going over the backlinks in GWT, client: “hey but there’s that page, you said you excluded it”, me: “yeah but…” a LONG explanation follows.
Furthermore: the removal tool works for 90 days only, which is not good enough in a lot of cases. Preventing a page from being spidered in robots.txt has the downside of you having to put all your precious URL’s in a file that anyone can see, OR having to cloak the robots.txt.
So, in all, I’d like the noindex meta tag and http header to prevent google / all search engines from displaying that page in the search results, and I’d like the default to be to assume a nofollow is present with all noindexes.
The fact that some websites get the noindex wrong by accident is a problem, I can understand, but you don’t solve a problem a minority of websites has by forcing a majority of people to change their ways. You solve that problem by educating the people maintaining those website.
Maurice Said,
February 25, 2008 @ 1:17 am
I vote no
and simply dropping a no indexed page is simpler to handle than having to build and test new ways of handeling noidex and then deploy it onto your infrastructure.
Save the money and spend it on somthing that will help users/webmsters.
Marek Said,
February 25, 2008 @ 1:24 am
You definitely should not base your decision on an informal poll of your blog readers:-)
I think you shuld respect the webmaster’s wish and do not show the noindex at all.
It’s quite simple actually. Your highest duty is to the user; but you must respect the wishes of the webmaster, since you are using his/her content to maintain your business alive…
If I posted a sign prohibiting taking pictures on my property, would it be right for a photographer to enter and publish them because his highest duty is to reader?
Chris Hunt Said,
February 25, 2008 @ 1:26 am
NOINDEX should do what it says on the tin - keep that page out of the index. If Microhoo!’s search engines are doing it wrong, I don’t think you should follow them.
Teddie Said,
February 25, 2008 @ 1:29 am
I think it is very important that webmasters have a way of completely removing all evidence of contents existence from search engines, and if noindex in conjunction with the standard robots.txt exclusions is the best way to do this then that is good.
Options 2 or 3 where you show a link to the page, begs the question what title and snippet would you display? I have serious reservations about your current treatment of this in regards to robots.txt exclusions, because currently it opens up the possibility of competitors using excluded pages as a way of spoofing content in the SERP. If I can override that with noindex then excellent.
This was my post about it last year:
http://www.search-engine-war.co.uk/2007/04/the_power_of_li.html
but the treatment still seems to be the same meaning that someone by linking into excluded pages could spoof malicious titles into a competitors results, which opens up a can of legal worms particularly for the finance industry.
Indexed URLs / blocked in the robots.txt:
http://www.google.com/search?q=site:adwords.google.com/select/ProfessionalStatus
Double check they are blocked:
http://www.google.com/search?q=site%3Agoogle.com+%22Google+AdWords%3A+Professional+Status%22
Spoofed title selected from third party website, this example is innocuous but the exploit could be used in other ways:
http://www.google.com/search?q=verify+professional+status+at+google
If you were to choose option 3 your only choice is simply to display the URL itself, which I think should also be the policy for robots.txt exclusions aswell.
Mick Said,
February 25, 2008 @ 1:37 am
I also agree in principal with EGOL.
It all went wrong for me here “Our highest duty has to be to our users, not to an individual webmaster.”
Of course it does but time and time again Google takes this standpoint and to guys like us, who are making the sites that Google list to best serve their users, end up left feeling like second class citizens.
If a webmaster adds NOINDEX your duty is to respect their wishes. Of course mistakes can be made and for me how to address that is the answer you should be looking for.
As someone else clearly points out, you have no interest in duty to serving your end users if you ban a site. How about all those German’s who searched for BMW to find it wasn’t there? Of course they broke guidelines and got the chop.
Its just an example to point out that your statement “”Our highest duty has to be to our user” is used only when it suits.
Dave Owen Said,
February 25, 2008 @ 1:38 am
If it helped some users to find what they were looking for by breaking into my house and browsing through my bookshelf, would that be okay?
The webmaster has clearly and explicitly withheld permission to index the page. It will be a tough task finding any justification good enough to cross that line.
Kralle Said,
February 25, 2008 @ 1:47 am
Webmasters MUST have the possibility to advise robots to NOT index a website. Google actually shows up forbidden URLs in the robots.txt that is surely not known by a lot of webmasters, so please don’t change the behaviour of noindex as the only effective possibility to forbid robots to index a website!
Gary Beal Said,
February 25, 2008 @ 2:05 am
With all the different ways (noindex, nofollow, robots.txt, url exclusion, redirects) to do this, knowing that 1 will definitely work and quickly as well.
GaryTheScubaGuy
Lea de Groot Said,
February 25, 2008 @ 2:10 am
Don’t change what you are currently doing - don’t break the web.
If you really must decide that ‘well-known’ sites will have their noindex statements ignored, then find a better method than the using mediocre DMOZ resource to define ‘well-known’.
But I can’t condone doing so - you’d be like the paparazzi chasing celebrities. They are asked to go away, but they just won’t listen…
Andrew Heenan Said,
February 25, 2008 @ 2:12 am
While I understand Google’s anxiety about accidental removal, NOINDEX (unlike nofollow!), does exactly what it says on tha can,and does Google REALLY want to take on a policy of deciding what’s a webmaster mistake, and what’s a webmaster intention?
It is quicker, cleaner than Google removal Tool - I wish all tags were as sensible.
Just re-read your list; what the other engines do actually LOOKS SILLY!
It ain’t broke, it don’t need fixing.
If education is needed, then educate; don’t dumb down.
Andrew Heenan Said,
February 25, 2008 @ 2:17 am
Sorry, I meant to also say:
I fully understand and agree that Google’s first responsibility is to search users; but this isn’t about first duty, it’s simply about respect; respecting a website’s stated desire not to be listed.
Out of interest, why has it suddenly become a hot issue? Is it a ‘reputation’ thing - or a realisation that people in other languages may be getting it wrong disproportionately?
Emmanuel Said,
February 25, 2008 @ 2:33 am
Hello Matt,
I agree with EGOL comment… no trespassing
If the webmaster specifies the page not to be indexed, it is clearly because the information is sensitive and/or not to be seen by anybody
g1smd Said,
February 25, 2008 @ 3:34 am
My observations are this…
URL denied by robots.txt:
Google does not access the URL. Google can list the URL as a URL-only result (no snippet) in the SERPs if it sees other pages linking to the URL. Yahoo also lists as a URL-only entry (no snippet) but sometimes goes one step further by crafting a title for the entry by using anchor text found in a link for the title (as long as the title is NOT some sort of generic “click here” or similar). I don’t like that effect.
URL with meta robots noindex:
Search engines fetch the page. They have to do that, in order to then see the meta robots noindex tag. However, Google does still cache the page internally (there was a bug a few years ago where meta noindex URLs in Supplemental index did show in SERPs for a few days under certain conditions), DOES assign PageRank to the URL, DOES follow links out from it, but does NOT show any reference to it in SERPs (except for the brief bug just mentioned). That’s expected behaviour I think.
There’s a bit of a conundrum if I add a “noarchive” tag to a “noindex” meta tag, as you will have to keep pulling the page to remind yourself not to index it, because you haven’t had permission to keep a record of what I said before. I am sure that is easily solved in some way.
From this, it is hopefully apparent that what Google stores internally for later use, and what is shown in the SERPs are two different animals.
I have a question though.
I have seen a small change to the way that the robots.txt disallow directive is handled. The change happened somewhere around, or before, the 2007 October/November time frame. New URLs on a site that are already disallowed by robots.txt, now show up as URL-only entries for a few days and are then dropped. That didn’t use to happen. It appears to me, that maybe they are listed “by mistake” (and I am guessing they are from the Supplemental Index) and then another process “cleans them up” within days. I think this change occurred around the time that Matt talked a lot about “minty fresh indexing”.
Example search:
http://www.google.com/search?num=100&filter=0&q=site:resource-zone.com+-inurl:showthread+-inurl:forumdisplay+-inurl:faq+-inurl:announcement+-inurl:guidelines+-inurl:external+-inurl:archive+-inurl:odp
The above search continues to show some URL-only entries in Google SERPs. These are all for URLs denied access by robots.txt and most are for new users or for new threads from the last few days (and which get dropped within a week or so), but there are always 30 to 60 URLs in this result.
The question is this:
If access is denied by robots.txt, and the page isn’t fetched by Google, how is Google able to judge that a page is not in English, and adds the translate tag to the URL-only entries for pages that are not in English?
Example:
www .resource-zone.com/forum/member.php?u=62939 - [Translate this page] Similar pages
This URL was “created” only a few days ago. The URL matches a robots.txt disallow pattern that has been in force for at least 18 months. I assume that Google has not fetched the URL, as it is robots.txt disallowed.
Is that content language guessing done on the basis of the language of the pages that link to this URL? If so, then that is a bit of a leap of faith, but I haven’t seen it guess wrongly yet.
Errioxa Said,
February 25, 2008 @ 3:42 am
If you write in robots.txt disallow this page then Google don’t read this meta, and Google should not show such a result.
Why use noindex if you can use robots.txt?
Will this option is not recurring?
You could also use the noindex to have another chance, it’s a good option.
g1smd Said,
February 25, 2008 @ 3:49 am
Hmm. It is a deeper issue than I originally thought.
I have just realised that the content for some of the URLs with an additional [translate] tag is only viewable to people who have logged in to the site. If you aren’t logged in, you get a “please log in” error message instead. So, even if Google had fetched the pages in violation of the robots.txt disallow, they still would not have seen the foreign language content either.
So, how is the “translate” results page able to tell me that the page was originally in Chinese, or Italian then?
What have I missed?
quadszilla Said,
February 25, 2008 @ 4:07 am
Google should show a link that pretends to go to the site but instead goes to sponsor results from adsense.
More money for google that way.
Ezra Goldschlager Said,
February 25, 2008 @ 4:09 am
Matt,
In response to:
“When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue).”
Although I think the engineers in this camp have a good point with respect to the “hurting the user experience” side, I’m not sure that it “looking like a Google issue” is a reasonable criterion. When something “looks like a Google issue”, I believe this doesn’t really have much to do with user experience (save for those users who might waste some time contacting Google to tell them about the “error” — does this happen?) or webmaster interests.
Ultimately, I think you have to assume that the vast majority of instances of NOINDEX will be legitimate, and intentionally implemented. If a webmaster doesn’t want a page indexed, it’s probably very frequently precisely for user-experience reasons. Perhaps the webmaster doesn’t want a user fumbling around on an uncompleted site, or doesn’t want a user ending up at a page that is likely of little to no use to him or her if he or she ended up there based on a Google search.
Further, for the vast majority of users, does it _really_ improve experience to get a blank “reference only” result? Or does this just add to confusion?
When I started writing this post, I didn’t have a strong opinion, but I think I’ve just convinced myself that Google should leave things as-is.
Nate Wood Said,
February 25, 2008 @ 4:18 am
I understand that Google wants to seem to have all the websites out there indexed, but NOINDEX means exactly that. I think you should respect it.
Why don’t you run an internal project, and if you’re getting a lot of NOINDEX pages from .ac, .edu, .gov, .mil domains during crawling, then simply send the webmaster an email asking if they intended to stay out of the Google index by using the noindex tag. If no, the email will serve to remind them to remove the noindex. If yes, then you have your answer and can behave respectfully to the webmaster by following their wishes. I’m sure this can all be automated too, which is true to the Google way.
Michael Brandon Said,
February 25, 2008 @ 4:19 am
Don’t show the link at all. Who is to say what is an “important” website? Does an important Government website that has noindex on some login only pages have those pages listed when it has specifically asked for noindex? If I have my cms admin/subscriber only pages hidden, nofollowed and noindexed etc, then I certainly do not want those pages to get indexed in any way even if there is password only access to them. If some random admin/mod links to them, I certainly do not want such pages showing at all, despite the inbound links.
I assume that such inbound links would be listed on the webmaster tools - make it that little bit easier to find and get rid of them.
I understand the argument about the Korean websites you mention. But I consider it a webmaster/SEO job to right the issue. If there are no results, anyone worth their salt can find the issue and correct it - if that is what they want. I have found and corrected many issues, even 404’s, and 500 headers being shown for all pages of cms driven websites. The line of webmaster responsibility has to be drawn somewhere, and I consider the purposeful inclusion of a tag should be respected.
I object to such an issue being even debated while Google is effectively no-indexing sites due to the duplicate snippet issue. Yet another client complained of substantially lower rankings today, and a simple search found websites that had copied the content. Previous clients issue to which no comment received - http://www.mattcutts.com/blog/duplicate-content-question/#comment-122817
Andy Beard Said,
February 25, 2008 @ 4:39 am
@Matt I hope you leave it as is - sure I see situations when it is added by mistake, though not always by a blog owner - there was that major problem with many blogspot blogs last year for instance, and Wordpress by making it easy for people to add this also should make it clear that it is switched on.
@g1smd
Google add titles to URLs blocked by robots.txt based upon anchor text as well.
For an example search for Wordpress SEO, and you will see the paid review I recently blocked with robots.txt with anchor text included.
Also maybe the RSS feed is being picked up, if not by google then by someone else picking up the feed and Google indexing that.
http://www.resource-zone.com/forum/external.php?type=RSS2
You can certainly tell the language by the snippets in the RSS
Philipp Lenssen Said,
February 25, 2008 @ 5:13 am
I think it’s a pretty clear-defined case: webmasters put “noindex” in their page because they don’t want the page indexed or shown. As you can’t know whether a webmaster perhaps accidentally put the noindex there, you have to err on the safe side and do what you’re told. Or how would you feel if people started to interpret Google terms of services in terms of, “oh maybe their lawyers just misspelled this and really mean something else, I’ll ignore it.”
Also, please do not try to push webmasters to always use a Google tool — like a URL removal tool — to do stuff; while Google search is close to a monopoly there are still other engines out there, and webmasters have better things to do than toggle a dozen tool’s configurations.
As far as a middle ground goes, I kind of like Egol’s suggestion to include a notice in the style of “some pages aren’t shown because they use a ‘noindex’ tag”, similar to how you disclose it when you censor e.g. the Human Rights Watch organization in China.
Tim Linden Said,
February 25, 2008 @ 5:44 am
I’m shocked this is even a question.. Is a 404 page next?
As for the comment that it makes Google look bad.. No, it doesn’t. It would if Yahoo listed the page and Google didn’t, but since NOBODY is indexing the page it’s meant to not be found.
jasonC Said,
February 25, 2008 @ 5:46 am
Like egol I say respect the noindex tag, it’s a no trespassing sign. The people that OWN the site placed it or had it placed there for a reason. This is the internet version of a fence or being placed on the do not call list. Violation of either can get you fines and/or jail time. While the end user is googles ultimate perspective and focus the OWNERS implicit statement of noindex means stay out and must be respected as a willful and knowledgable decision on their sites privacy. If any search engine is willing to violate a site owners privacy then violating the end users privacy is sure to follow.
On a mass scale google (or any SE for that matter) cannot determine the reason a webmaster put the noindex tag there. With all the attention and importance being placed on privacy (see “E-mail (required, never displayed)” on your own comment form) google has the duty to respect any and all noindex and nofollow tags. They’re there for a reason- excluding mistakes. If they put the tag there by mistake they’ll know it soon enough and will correct it.
All search engines have to keep in mind that people will put up sites that are private and should not be available to the general public. I know of many sites and message boards that are populated only with family related information and they HATE anybody being on the boards that are not family. The information is private and should be kept that way. The thought that google is toying with the idea of intentionally violating that privacy is scary and wrong.
If the noindex tag and the privacy it affords is not honored what’s next, indexing someones pc to improve the end user search results like microsoft seems to be considering. (I’m in the process of moving to linux because of that statement alone)
As a general policy all search engines should respect the noindex tag, PERIOD. If it’s a large site and you think they did it by mistake send them an email to inform them of it. That way the onus is on them not google.
Dito Said,
February 25, 2008 @ 5:46 am
Matt,
If you choose the middle road, you may want to consider using a different metric than a listing in the ODP. (possible alternatives: PageRank comes to mind, comScore, Compete).
As much as I want my site listed there, having to possibly wait for over two years for a listing, and even then no sure thing, the directory just isn’t a reliable source on the web. It’s dated on any given day. It’s the internet 2 years ago (ok, that’s a stretch).
I personally like the idea of NOINDEX not indexing at all.
Thanks again for more great information.
Jim McNelis
Teddie Said,
February 25, 2008 @ 5:53 am
@g1smd
>> Google does not access the URL. Google can list the URL as a URL-only result (no snippet) in the SERPs if it sees other pages linking to the URL.
Actually Google behaves similar to Yahoo! if the search text matches some link text pointing at a blocked URL Google uses theink text as a title. I think this is very bad practice. See my examples higher up this page.
>> If access is denied by robots.txt, and the page isn’t fetched by Google, how is Google able to judge that a page is not in English, and adds the translate tag to the URL-only entries for pages that are not in English?
I also spotted this on some of the blocked results for the adwords professional status URLs coming from Googles own website but overlooked it. In the SERP the [ Translate this page ] link assumes there is dutch content on some the pages, however it doesn’t work on all the foreign language URLs it missed quite a few Swedish and Spanish ones. Some of the professional status URLs contain hl= interface language but not all of them do, so my guess is it looks at language from the page or link text pointing to the blocked page.
Ryan Said,
February 25, 2008 @ 5:59 am
I’m confused why there’s even a discussion. It seems like it should tell robots not to index just like its name says. Same idea as excluding by robots.txt.
The only time I really use this tag is when I’ve got multiple URIs to the same page. For example, some ecommerce sites (osc) put the path to the item as part of the query string and that can lead to duplicates. The noindex tag is the easiest way I’ve found to keep certain variants of that from being indexed.
And I understand the argument of user experience, but shouldn’t it be up to the page author to decide if their content is included? If I don’t want a page, or a whole site, included in search engine indexes that should be my decision, not the search engines.
Sebastian Said,
February 25, 2008 @ 6:04 am
Hey Matt,
Thanks for asking! Of course Google has to respect NOINDEX as is. The sole question is how to enhance navigational SERPs when the best result is NOINDEX’ed. A message like “the best matching result is unfortunately blocked by the site” would suffice IMO.
If you really can’t live with that, then support a robots.txt directive like
NOINDEX=NOREFERENCE: /respectively
NOINDEX=REFERENCE: /so that Webmasters can decide whether or not they allow you to mention references to forbidden stuff on your SERPs.
I’ve written a longish pamphlet on this topic today, discussing other possible solutions too:
@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!
I really hope you’ll do the right thing.
Thanks
Sebastian
Sebastian Said,
February 25, 2008 @ 6:09 am
Hey Joost,
Google’s handling of the implicit FOLLOW is perfectly in line with the standards, and you really do not want Google to changes that. Think of internal hubs, site maps and such on large sites where you can’t output a gazillion of links in human readable format for example. NOINDEX,FOLLOW is a perfect way to serve bot maps without bothering/annoying visitors/searchers, and there are many other use cases too.
Cheers
Sebastian
Matt Said,
February 25, 2008 @ 6:25 am
Why complicate this? The tag says NOINDEX. If you want your page indexed don’t add the tag. If you don’t want the page indexed it shouldn’t be.
David Cooley Said,
February 25, 2008 @ 6:26 am
Google already has this one right, why change that?
You should not change things to cover a few exceptions.
joshua Said,
February 25, 2008 @ 6:26 am
While on the topic of NOIndex: I had a blog who’s index page was set with NOIndex, as I wanted only real pages indexed, not the summary index page. However that noindex setting prevented blogsearch.google.com from indexing my feed. I confirmed with a few others blogs as a test and indeed once we removed noindex from our home page, blogsearch operated as expected and almost immediately.
Nate Wood Said,
February 25, 2008 @ 6:32 am
Actually, I’ve been thinking about this a lot this morning: It’s GOOGLE’S product, and Google is free to do with that product what they wish, as you keep telling us about paid links.
I do like that you ask webmasters’ opinions Matt, I really do. I respect that. But it seems like a one way street unfortunately. Webmasters are using tools and coming up with ideas to make the Google product better (in the hope that somehow it’ll remove the crap above them in the SERPs), but there doesn’t seem to be a lot of love going the other way, huh? We even create content to make it easier for Google to index and have better quality SERPs. However, if we play completely by Google’s rules, do we get a ranking benefit? Nope. If we step outside of the lines, do we get penalised? That’s the threat, although I never see any of my finance competitors getting busted over link schemes like DPA.
So, I’m sort of lead to ask - what’s in it for the webmaster exactly to keep tweaking their settings in order to allow Google to have a better product offering? We always hear the “making the internet a cleaner place for users” argument, but webmaster suggestions and efforts help to make Google a multibillion dollar industry. If webmasters all refused to set their preferred cononical domain, or put nofollows, noindexes, good robots docs, etc, would the Google SERPs be of the same quality? Would the very obvious spam that exists in the Google index be gone? Would you be prepared to give back to those people that help Google out by adhering to your rules?
Tracy Hurley Said,
February 25, 2008 @ 6:35 am
I think NOINDEX should continue to work as it currently does. However, it might make sense to add a third possibility, LIST, which would not analyze the page but would allow the url to be listed in Google similar to the way Microsoft and Yahoo currently list NOINDEX sites and the way Google lists sites where it has found links but has not yet crawled the site.
Beyond that, if Google Webmaster Tools is not currently doing so, it should have a check for the NOINDEX tag and alert the webmaster to its existence. While not everyone who might accidentally add the NOINDEX tag uses Webmaster Tools, you can help at least some of them. Google Adsense and Analytics would also be great places to add checks and warnings, as I’m pretty sure far more people use one of those on a regular basis than Webmaster Tools.
The middle ground of using Open Directory sounds like an interesting idea but I am concerned that a site could get submitted to it and the website owner really meant NOINDEX. I know that this doesn’t help Google with its customers but I don’t think the customer’s wishes should supersede that of the probable copyright holder.
Like other commenters, I don’t believe using the url removal tool is a realistic or reasonable alternative to the NOINDEX directive. The tool only works for Google and is not permanent. In addition, the website owner needs to use Google’s Webmaster Tools to reinclude content. Webmasters should have the ability to exclude and include content through the use of meta tags and should be able to make that information available to all search engines, not just Google.
Stefan Said,
February 25, 2008 @ 7:06 am
It’s obvious to me: NOINDEX means clearly “Don’t index this page.” aka “It has to stay out of any index.” If a page is labeled as NOINDEX, it should never show up for a query, no matter what.
JLH Said,
February 25, 2008 @ 7:10 am
My vote is that it should not index anything, as advertised. Maybe a noindexunlessinthedmoz is needed or noindexunlesslinkedto etc, but short of that the tag should work the way it is named.
As far as Google looking bad, I’m not so sure that is a real concern, plenty of sites get removed all the time not for quality issues but of spamming issues that are completely invisible to the average user yet the average user must wonder where the site is, sites that are not fully indexed use the site search function with missing pages giving an incomplete search result, daily people don’t understand why Google of all engines cannot find their links in the GWHG, PageRank is shown as the “importance” of a page yet only the SEO community knows that is rarely updated and often wrong, the last crawl date on in GWT is often quite wrong. So having supposed inaccuracies is not a new problem at Google, as matter of fact it’s even designed into the algo ( ex. Link:) and those don’t carry a warning like, “Google has purposely withheld showing all of the links pointing to this site”
If you want to tackle confusion on webmaster’s part rather than mess with an directive which seems pretty clear to me, noindex== don’t index, I’d take a look at the “none”. I’ve seen far more people accidentally using this one thinking that it means to “no restriction” rather than it’s actual meaning of ” noindex,nofollow” as seen in this blogpost by Vanessa:
http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.html
Maurice Said,
February 25, 2008 @ 7:19 am
competely off topic
Your Guys seem to be having fun in the UK insurance market - some of the big boys Go compare and Kwickfit seem to have been naugty boys and got penalised
Dr. Cushty Said,
February 25, 2008 @ 7:41 am
“If high-profile sites like
- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)
aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users.”
Granted, the user finding what they need is important. But if a site has deliberately and clearly told Google they do not wish to be indexed - ESPECIALLY big, high profile sites which have obviously made a decision in this regard - then that is their prerogative, regardless of anything else. I think it’s really dumb to do so, but people have the right to be dumb and not to have their wishes ignored. As many previous commenters have said, it’s like ignoring a no trespassing sign.
To me, == do not index this page, at all. Simple as that.
William Said,
February 25, 2008 @ 7:43 am
I vote don’t show the page but allow it to be indexed, as you do currently. This is handy for people that don’t want to have the website indexed (can put it on every page) but want to check the sitemaps files, 404 errors etc that the webmaster tools offers, which the robots.txt exclusion does not allow.
I basically use a stage domain with robots noindex, nofollow on every page, add the sitemaps to Google Webmaster Tools and then you can see if there is any sitemaps errors, 404 errors etc.
Russell Jones Said,
February 25, 2008 @ 7:44 am
NOINDEX not only means do not show my sites in the search results, it means do not include my site in an INDEX at all.
I have some brothers at law firms that would love to start the class action suit on behalf of webmasters when Google changes it’s policy on NOINDEX.
Walter Wimberly Said,
February 25, 2008 @ 7:48 am
I vote that the page should not be seen, as requested by the web master. This has been around for many years. The web master is the owner of a site, and this is (as someone mentioned) like putting a “No Trespassing” sign on the front lawn. Reasonable, lawful people are expected to obey it - and it is within a normal expectancy to believe that the search engines are reasonable, rule-following on-line citizens.
Likewise, I don’t think it should be completely removed from Google’s internal indexing however, it should just not be shown. Leaving it in an internal index, and flagged as a NOINDEX, would allow Google know about it when it gets a link from another site, to not bother to show it. Likewise the internal referencing might allow the page to be rechecked, although not as frequently, to see if the NOINDEX was removed. (This removes other issues like web master mistakes, to naturally be healed over time.)
I would say a regular user has no real idea how a search engine works. they will often expect that as soon as a page is uploaded to the Internet all of the search engines know about it. Likewise, they have a love/hate relationship because it doesn’t give them the best result the first time they search on the first item on the page. Most users don’t understand detailed queries in a search engine, or using modifiers like the quote or minus sign. So how can we expect them to understand that someone doesn’t want a site /page indexed?
LW Said,
February 25, 2008 @ 8:02 am
I think that a no-index tag should do just that. I don’t feel it should be changed to help out sites that don’t really know or can’t figure out what they’re doing wrong - like the Korean sites you mentioned. Who is to judge whether a site is important enough to have it re-evaluated.
Don Macaskill Said,
February 25, 2008 @ 8:07 am
Hey Matt,
Lots and lots of SmugMug’s customers have our privacy controls enabled. One of them basically NOINDEX’s (and NOSNIPPET, NOARCHIVE, etc) all their pages.
They’re hoping (rightfully so, since it’s always worked this way) that that’ll get rid of all traces of their SmugMug account from Google’s index. They don’t want search engines to find them at all.
If you change your behavior, it won’t just be webmasters you’ll be hurting - it’ll be end users, too, who have an expectation of controllable privacy.
rob Said,
February 25, 2008 @ 8:12 am
I did actually vote on the somewhere in the middle option but having thought about it a little more I think that status quo is pretty good too.
What I’d suggest is that Google look at advising webmasters in the different ways in which it can be used with or without the follow directives and things like that.
A url removal tool. A don’t ever index this page tool.A don’t index but follow the links toolA don’t index and don’t follow the links tool
The more options the better all around really.
Aaron Said,
February 25, 2008 @ 8:31 am
Hi Matt -
I think that Google should completely drop the page. With that being said, I like the idea of putting a warning/error in the Webmaster Tools that shows which pages are noindexed. If the homepage is noindexed, then the error can be prominently displayed in the overview for that domain in the Webmaster Tools. When a user sees that, they can dismiss the warning or keep it there until it’s fixed.
-Aaron
Silver Said,
February 25, 2008 @ 8:34 am
Matt, I naturally would prefer not to have meta Noindexed pages showing up in the results (as I voted).
But, it’d also be helpful IMHO if you crawled links on NOINDEXed pages, if they specify Follow. In the past, it appeared to me that NOINDEX,FOLLOW might not be actually followed. Or, if they are followed, the PR doesn’t seem to flow through the NOINDEXed page, so the followed pages might as well not be indexed, either…
This would be helpful to us when constructing sites where we need to link up large amounts of pages, though the intermediary branching pages are not worthwhile to have indexed. I think this would help in terms of quality, since the pages would be of low worth, while the destination content pages they link to may be entirely worthwhile.
Matt Cutts Said,
February 25, 2008 @ 8:47 am
“Is there any way to block the bots from indexing a part of a page?”
Pratheep, Yahoo has proposed something: http://www.ysearchblog.com/archives/000444.html . However, the last time we checked, less than a thousand sites were using those special tags. Given the small number of sites involved vs. doing other things with engineering resources, so far we haven’t added support for Yahoo’s tags.
However, we do offer something for AdSense publishers called section targeting: https://www.google.com/adsense/support/bin/answer.py?hl=en&answer=23168 . That allows an AdSense publisher to say “only target ads to this section of text.” Most AdSense publishers could improve the relevance/CTR of their AdSense ads by using those tags well.
Harith, I just got to your follow-up comment. In that case, currently Google wouldn’t show the page but would follow the outgoing links.
Matt Cutts Said,
February 25, 2008 @ 8:57 am
“Out of interest, why has it suddenly become a hot issue? Is it a ‘reputation’ thing - or a realisation that people in other languages may be getting it wrong disproportionately?”
Andrew Heenan, it’s not a hot issue, but every month or so we get a small trickle of sites that are shooting themselves in the foot. “Why didn’t Ben Harper’s site show up? Did bmw.dk remove itself on purpose? These Korean sites have dropped from our index.” Every time we see a site that appears to have made a mistake, it re-opens the conversation.
“As for the comment that it makes Google look bad.. No, it doesn’t. It would if Yahoo listed the page and Google didn’t, but since NOBODY is indexing the page it’s meant to not be found.”
Tim Linden, part of the issue is that Yahoo and MSN do return links to NOINDEX’ed pages, so people sometimes think that it’s a Google issue.
Don Macaskill, thanks for weighing in with a good point about SmugMug. I had thought of the case of parked domains, but not privacy controls on profiles.
Olson Said,
February 25, 2008 @ 8:59 am
If it has the tag, don’t index it. However, if you want to be nice to webmasters who have somehow put it in their code without knowing, why not send them an email telling them to check if they don’t want indexing?
Matt Cutts Said,
February 25, 2008 @ 9:00 am
Everyone, I really appreciate the thoughtful comments here. There is a fundamental tension in this case between an easy-to-use tool for site owners vs. more information for users. I agree with many people here that there should be an easy way to remove all traces of a site from Google. I do however also feel the pain of search ranking people at Google who see a navigational query for e.g. a well-known government website and we might not return the answer that the user was looking for because of the NOINDEX meta tag.
Johan Said,
February 25, 2008 @ 9:13 am
NOINDEX means no index. Why confuse the matter with gray areas.
teddie Said,
February 25, 2008 @ 9:14 am
Surely thats’s an issue with education of government webmasters? It’s exactly the same in the UK, maybe worse because the Government hands out all these red herring eGov meta data standards for them to obey, which puts they out of touch with the actual public user exerience of their websites.
Johan Said,
February 25, 2008 @ 9:22 am
As Teddie said it is indeed an education issue with a minority of sites. I would back down and keep the domain name or homepage indexed in this middle ground mode, but keep all other pages out.
OnlyMe Said,
February 25, 2008 @ 9:24 am
Wouldn’t it just be easier and consistent that if a page or a website not be indexed is just use the robots.txt file? Rather then hitting all the pages that has NOINDEX can easily be removed from one file.
I’ve come across many sites that are not indexed when they should have been and only to find out that they have a NOINDEX tag on every page. Lately I’ve been finding them in the middle of the coded page rather then at the top where it should be.
It’s just common sense.
Andy Beard Said,
February 25, 2008 @ 9:37 am
Matt, many of these large sites the webmaster might possibly be signed up with webmaster tools, or you have access to the whois record.
Wouldn’t it be easy to send some kind of automated friendly email to a site owner warning about a significant change in your ability to index the site?
It is something I always attempt to do when I see a friend’s site that for some reason is noindexed, just in case they made a mistake, and that usually is the case.
Rob Said,
February 25, 2008 @ 9:44 am
Like said before in the comments: “NOINDEX means no index. Why confuse the matter with gray areas?” It is a computer command, and should do what it is programmed for. Google does have thge right to ingnore it though, and make their own rules on how they read a website.
Consistency often equals quality. If Google is not consistent with their method of handling items like this, the lack of quality will quickly cripple the high horse.
ark Said,
February 25, 2008 @ 9:44 am
It’s currently very difficult to keep a site off Google, please don’t make it harder NOINDEX should not show up in results at all, no matter how many people link to it. All you need is one slashdot story and suddenly lots of people are linking to some confidential information and now they can find it with Google.
I have a site I want private. I had NOINDEX on every page and I had a very restrictive robots.txt (told everyone to go away) well the site still showed up on Google
they said since my robots.txt told them to go away they couldn’t find my NOINDEX directives so displayed the site links even though they didn’t show any snippets. That was when I learned about the secret noindex robots.txt rule, just for Google (how annoying!)
Please help us to keep some of our content out of Google, we should have control over that in a standard way.
and don’t even get me started on there being no easy official supported way to do this with rss feeds (not that feed search engines seem to honor anyway)!
bd_ Said,
February 25, 2008 @ 9:50 am
ark, if you want a site to be truly private, the best way is to just set HTTP basic authentication on it. If it showed up in google at all, then that means there’s a link to it somewhere on the public internet; if google can stumble across it, so can others.
g1smd Said,
February 25, 2008 @ 9:52 am
@ark - see my first post in this thread.
Bambarbia Kirkudu Said,
February 25, 2008 @ 10:15 am
NOINDEX:
- Simply… NO INDEX! (a page)
- Subsequently… If page is not indexed, it is not searchable.
- But… Is Page URL searchable?
So, returning to older discussion:
- “Page” (in search engine internals) == “Content” + “URL” + “HTTP Headers”, “MD5 Hash”, may be more.
“NOINDEX”: do not index content! But you can still GET url, PARSE page, RETRIEVE outlinks, and etc. You can even index page URL… but it’s your choice.
Sebastian Said,
February 25, 2008 @ 10:17 am
Ark, instead of using undocumented robots.txt syntax you could give conditional HTTP response codes a try.
Bambarbia Kirkudu Said,
February 25, 2008 @ 10:21 am
BTW, Yahoo proposed HTTP extensions (instead of META tags).
Such extensions allow NOINDEX without actually fetching a page (by using HEAD request from SE).
What about Google?
Ramoney Said,
February 25, 2008 @ 10:23 am
NOINDEX from my point of view is self-explanatory - that is: do not index!
Though Google does not show my NOINDEX pages in the SERPS, I still wonder why terms from those pages show up in [b]Google Webmaster Tools > Statistics > What Googlebot sees > In your site’s content[/b]. I said NOINDEX!
Morristown Said,
February 25, 2008 @ 10:36 am
Hi Matt,
I was wondering if you will be writing soon about your take on the newest activity being promoted by a group of blog and website marketers. The specific new thing they are promoting is a extension of the do-follow blogs. Here is the twist. They are promoting a do-follow / no-follow plugin. This will allow them to trade blog comments with each other and in there words link build, but put No-Follow tags on any of their competitors who choose to comment on their blog posts.
Joe Hunkins Said,
February 25, 2008 @ 10:41 am
Hi Matt -
People here want NOINDEX to remove the pages because they understand the SEO implications and NOINDEX is a very helpful feature when used properly. However your Korean example indicates there is confusion out there which means some sites are using NOINDEX without realizing the implications, which are severe for sites and users who want to be found/find things.
IMHO the solution is the same as for other challenges with Google ranking issues - much more robust automated communication with sites. Both via Webmaster Console and an email to the whois contact Google should have a “Webmaster Alert” system that sends out basic obvious information about the status of a site that is not indexed, is perceived to be in violation of guidelines (which is often unintentional or due to confusion), or is downranked as part of the infamous algorithmic mysteriousness.
This scalable, human-free approach would save the webmaster community *millions* of hours each year trying to diagnose problems to please Google. It would also help Google because (again just IMO) the lack of broad based, effective communication with webmasters has more potential to hurt the company than is currently realized internally.
e.p.c. Said,
February 25, 2008 @ 10:42 am
In the pre–Google days, we used NOINDEX,FOLLOW at ibm.com because we found search engines were ranking pages that listed products higher than the intended product page. You couldn’t use robots.txt because you wanted to mark “/products/” as non-indexed, but allow for content under “/products/” to be indexed (eg “/products/thinkpad/”). I think it’s still valid in a pagerank–influenced era. If “important” sites are missing from search results because they’ve chosen to add NOINDEX, then they’re shooting themselves in the foot, but you can’t really know that you’re improving UX by surfacing those results.
Rob Said,
February 25, 2008 @ 10:54 am
Relevancy in SERP is the main question. How better can you produce results, users are looking for? That should be the fundamental question. Lately, I have noticed some great changes in SERP for queries, some results are good, some are bad. It behooves me to ponder on this quagmire that I, as both a user and a webmaster, have to seriously analyze the data that I find online, albeit all these proclamation of how advance the technology has become. I have to sometime go past 5 or 6 pages to find a simple topic specific results in Google. Until the search engine is perfect in accessing human behavior online to be able to serve customized SERP to each individual, we have to assume that the form of AI is inept in understanding a complex human behavior, hence the machine should be taught, and we should give it specific instruction. 0 is OFF, 1 is ON.
Farhad Said,
February 25, 2008 @ 10:57 am
Show a link, so that the users can still navigate to the page, but don’t index it or cache it.
After all, the tag is “noINDEX”, not “novisitors”!
Peter (IMC) Said,
February 25, 2008 @ 10:58 am
I voted for “Show a link to the page “. The reason is that the site owner doesn’t own the links in other sites. So Google has the right to index the links. Links are in the end, content too.
Of course you do have the issue that the content of the page could be unrelated to what the link suggest. That can even get you in trouble in case people (the spamming type) try to abuse this logic.
Therefore Google has to check the page with the noindex tag. Noindex means: don’t index this page. But it doesn’t mean: Don’t visit this page. So Google is allowed to get the content of the page to determine a relevancy score for the link that’s pointing at it. After that it can just forget about the page again.
I guess it’s like “Howard Hughes” You couldn’t “index” the man, but that didn’t mean you weren’t allowed to talk about him.
Multi-Worded Adam Said,
February 25, 2008 @ 10:58 am
I look at it like this, and I know I’m not the only one…the problem isn’t the attribute, but the underlying logic behind it.
NOINDEX is an “opt-out” attribute…which is fine if there’s an “opt-in” clause in the first place. The very logic behind search engines crawling and indexing sites (and for that matter, other bots) is fundamentally flawed in that it’s done based on a single hyperlink, and not always with the prior approval of a webmaster. There are cases out there where sites can’t get into search engines because they don’t have at least one inbound link from a page that is crawled frequently; and there are those sites that have appeared in SERPs by accident (the example that comes immediately to mind is when a webmaster asks a question about a site under construction to try and get some help.)
In other words, take NOINDEX a step further, and don’t index anything before a webmaster/SEO uses the Webmaster Tools to sign up, say “I want this site indexed, but only Pages A, B, and C, not X, Y, and Z”. In other words, use the technologies that you have in place to create an opt-in environment and allow users to indicate what can and can’t be indexed…not the present “absence of a ‘no’ is a ‘yes’ answer.”
You’ve got the tools.
You’ve got the technology.
You’ve got the database size and algorithms to be able to endure the short-term hit that would happen.
And most importantly, people wouldn’t be able to hide as well…the ones that have something to hide, that is.
GIT ‘R’ DONE!
Blackbeard Said,
February 25, 2008 @ 11:05 am
I think that NOINDEX works fine the way it is. I don’t like the idea of google making links to the noindex’ed pages mostly because it can serve useful purposes.
For example, say you wanted to test a marketing campaign and you set a url variable to track that. (ex: http://foo.com/index.php?tracker=google_adwords) Now, if the page(index.php) already is indexed, that would show up as duplicate content. However, if you attach a NOINDEX to index.php if the tracker variable is set, then you don’t have dupe content issues anymore.
In the example above, if the Google index had links to the noindex’ed pages, it could create potentially hundreds, or thousands of duplicate pages that Google would be linking to. That would not be better for the user, and in my example, creating a robots.txt file to track every single index.php variation with a tracking variable would be inconvenient at best.
A better solution would be for Google to create a site that does a better job of explaining all of this technical stuff to webmasters. Webmaster central offers great tools, but perhaps a companion site that would offer best practices and explanations beyond the webmaster guidelines would help. Also, if Google notices some important sites that have noindex’ed themselves out of the index, perhaps Google could send them an email notification? Either way, educating webmasters with better information will go farther than changing the current system.
ToddW Said,
February 25, 2008 @ 11:09 am
Don’t show it at at all. That’s why I use it.
If I wanted it shown or partially shown I would.
Brent D. Payne Said,
February 25, 2008 @ 11:15 am
Matt,
1. I like that Google is asking for information on this topic. There was a time in Google history that this type of open communication wouldn’t have happened. I feel you and your blog are a key reason Google is reaching out to webmasters. Thank you.
2. NOINDEX means NOINDEX. I have a lot of personal photos (no not those) on the internet and I don’t want people finding them when they Google my name or my son’s name or my family members. It’s meant for family and friends not for the world. I NOINDEX those because Google doesn’t need them, nor any other search engine. It’s my content not Google’s.
3. You need a tag for DUPLICATE CONTENT. I took on the role of SEO Manager at Tribune last week and we have a ton of duplicate content that isn’t going to go away do to business reasons. Example, L.A. Times writes a story and the Orlando Sentinel may pick it up. Both Tribune properties and both want the content on their site. I’d like to have a meta tag that tells Google that although the information on Orlando Sentinel is important and should rank higher in say Orlando for the article, if it is someone searching in Maine then the L.A. Times is the real source of the information and L.A. Times should get the priority not Orlando Sentinel. I’d like a ‘trackback’ option for web pages. It would tell Google that here is some good content that may be useful for people that prefer the Orlando Sentinel (browsing history, locale, other sites that they visit that point to Orland Sentinel, and a myriad of algorithmic possibilities) but for every neutral party . . . send them to L.A. Times because they wrote the article and they should get some extra special treatment for doing so. I’m going to attempt to accomplish this via HTML coding but it would be a better user experience if Google had a tag for ‘SOURCE’ or ‘DUPLICATE CONTENT SEE XYZ’ etc. I’d think it would make Google’s life easier too.
Again, great topic and thanks for asking for the feedback.
Sincerely,
Brent D. Payne
SEO Manager
Tribune Interactive
http://www.tribuneinteractive.com/network
Brent D. Payne Said,
February 25, 2008 @ 11:22 am
One more thought . . .
Perhaps a middle ground could be the following.
A note at the bottom of the SERPs if a site has been omitted due to the NOINDEX tag. You said yourself (to Eric Enge) that a NOINDEX page still collects PageRank and passes PageRank so I’d think a webmaster would want to know that they could’ve ranked #3 for a particular search had they not put a NOINDEX on the page.
So . . .
“Some results have been omitted per the request of the site owner. To view Google’s policies on this matter, please see our Webmaster Guidelines.”
Brent
David Said,
February 25, 2008 @ 11:26 am
Pretty clear that it should be left out of the Google listings
Would also save Google lots of resources… once it hit that meta it just skipped the page.
Katinka Said,
February 25, 2008 @ 11:47 am
I use this for pages that I DO NOT WANT INDEXED, so please keep it that way. For instance pages the content of which I have on two places, and I want the other page indexed. Or as an alternative for a 301-redirect, in cases of clients with servers that don’t give the option of a 301-redirect.
There is no reason to put this tag on anything one does want indexed, so please just respect the webmasters on this one.
David Jacques-Louis Said,
February 25, 2008 @ 11:49 am
It’s one thing I adore - Matt answering the first comments at a newborn post.
I once even wanted to parse the blog and make some stats on digg.com availible, so poeple see and know what this G guy is about.
Asking for tips is a nice way. But nobody has answered user questions at reinclusion request post for years now. And it’s not the way people should be treated. Because you doen’t get an answer at Google Groups either.
if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag - it does make sense to me.
Wesley M. LeFebvre Said,
February 25, 2008 @ 12:25 pm
I believe all pages should be handled the same regardless if they are high profile sites or not.
Tonnie Lubbers Said,
February 25, 2008 @ 12:28 pm
NOINDEX should be exactly that. It says don’t index this page.
Don’t make it more confusing for what it might or might not mean.
Peter (IMC) Said,
February 25, 2008 @ 12:39 pm
Bolding by me.
I don’t agree at all that all traces of a site should be removable by Google or any one else for that matter. A site owner can not dictate what others show on their website. A link in the website http://www.google.com has as much right to be there as a link from http://www.whateversite.com. Why would a website owner have the right to tell anyone not to link to his content?
Google’s only concern should be to the relevance of its results. And determining the relevance of a link doesn’t require the page it links to, to be indexed. All you need to do is visit the page to determine the relevance of the link. That’s not indexing, that´s just visiting the page and using the content as a base of information to determine the relevance of the link.
This issue is much more fundamental than just what webmasters want. A webmaster has no right to say to anyone to link or not link to their pages and they even more so can’t tell some yes and others no.
Pages online that are not password protected simply are publically available and you can’t tell anybody to not visit these pages. Even a noidex tag can’t forbid a search engine to visit the page, it’s Google’s choice not to add the page to their index. Which you don’t anyway. But it does not imply you can’t use the content of the page to determine the relevance of a link that points to it.
I don’t see the problem here to be honest. A webmaster would want to forbid Google to link to a page? Makes no sense at all. If it’s that important to them, they should program their server to redirect a visitor that came from Google back to Google or to a page that says: “Sorry, you came from Google, you can’t see this page, why don’t you try to find another site that links to us. In that case you´re welcome.”
Obviously that’s so ridiculous that it is obvious that this whole issue is a non issue..
Peter (IMC) Said,
February 25, 2008 @ 1:04 pm
By the way, that no trespassing example is wrong. The webpage that has a noindex tag is still publically available. The land behind the no trespassing sign is not. Makes a huge diference.
tobyism Said,
February 25, 2008 @ 1:13 pm
Whats googles policy on spam? No Spam, right?
So what if there was a little spam? or spam with a line drawn threw it? or a reference to spam without the spammy information? No? Thats not how google handles spam?
Perhaps NO INDEX should be handled similarly.
Alan Perkins Said,
February 25, 2008 @ 1:34 pm
Hi Matt
First off, I have no problem with NOINDEX acting exactly the way it currently does. It ain’t broke, IMO.
Next, you need to be very careful with statements like “Our highest duty has to be to our users, not to an individual webmaster”. That position is just not ethical, IMO. If a webmaster does not want her content to be found by Google’s users, then who is Google to say otherwise?
Not only is that position not ethical, it may not even be legal. Try looking into eBay vs. Bidder’s Edge, in which a judge used an ancient law of trespass to ascertain that Bidder’s Edge robots were trespassing on eBay’s property - and, in that instance, eBay wasn’t even using robots standards. Think of the outcome if a webmaster specifically tells Google to “Go away” and Google ignores the request. Of course, I can imagine why Google may wish to ignore robots standards. But … it can’t! It’s one thing to assume opt-in-by-default - I think that’s an ethical thing to do - but quite another to force an opt in.
Having said all of that, it may be worth you considering a couple of things that are “middle ground”.
1) How closely have you considered the difference between a URL and the content at that URL? Perhaps it would be OK to index a URL, but just not the content at that URL … analogous to the Partially Indexed Pages that arise from current robots.txt treatment. i.e. NOINDEX could literally mean “Don’t index the content”, not “Don’t index the URL”. I think it’s very important that Google a) doesn’t index the content; b) show snippets of the content; or c) provide a cached link to the content. Of course, b and c are impossible assuming the first condition is a given.
2) Maybe option 1 is going a little too far in favour of the user, but you could apply the same thinking purely to the home page of a site. This would alleviate the problems seen by http://www.police.go.kr/main/index.do, http://www.nmc.go.kr and http://www.yonsei.ac.kr, all of which are home pages. (http://www.police.go.kr/ -302-> http://www.police.go.kr/index.jsp -302-> http://www.police.go.kr/main/index.do)
Send my regards to Adam, Luisella and Fili who I met at SES last week. They all seem like nice people. Shame you could not make it, though.
Peter Young Said,
February 25, 2008 @ 1:36 pm
Surely implementation of a noindex tag on a page is something that in general isn’t implemented by default, and therefore must involve a degree of consideration before implementation.
By that rationale, surely allowing the webmaster/design some level of intelligence and credibility behind such decisions would answer your question for you.
Rob Said,
February 25, 2008 @ 1:46 pm
Peter has a valid point. He hit the nail right on the head. Relevancy in search result, thats the key.
g1smd Said,
February 25, 2008 @ 2:12 pm
*** That would not be better for the user, and in my example, creating a robots.txt file to track every single index.php variation with a tracking variable would be inconvenient at best. ***
Errrr. No. All you need is:
Disallow: *?tracking=
and you’re done.
Alan Perkins Said,
February 25, 2008 @ 2:39 pm
The fact is that sometimes the meta robots tag offers the only means for preventing indexing. Simplest example? Suppose you don’t want the home page (/) to be indexed, but you do want every other page on the site to be indexed. You can’t do that with robots.txt.
Also, on some highly dynamic sites, the meta robots tag is extremely useful … e.g. on a site with three query parameters, for each URL there are 16 combinations of those query parameters. The robots meta tag is invaluable for preventing most of those 16 combinations being indexed, something that would be very difficult if not impossible with robots.txt.
In other words, sometimes robots.txt is just not an option. For this reason, it’s essential that noindex offers at least the same protection from indexing as robots.txt - which means partially indexed pages, at worst.
Salim Notta Said,
February 25, 2008 @ 3:03 pm
How about adding a new variable in webmaster tools which allows site owners to reference some default content?
For instance, if a site owner applied no follow rule for a particular page they should reference some default page/content OR if s/he decided to remove entire site from G than at least they can reference link to some other SE where user can locate their site. As Matt C mentioned G cares to satisfy user experience; so it shouldn’t concern G which src a website owner wants to direct user.
Michael VanDeMar Said,
February 25, 2008 @ 3:06 pm
Matt - I tried to get in touch with you earlier on something related to this, did you happen to get that?
Also, quick sidenote… you should seriously consider getting the Chunk Urls for WordPress plugin… g1smd’s comment way breaks your layout in FF.
Domas Mituzas Said,
February 25, 2008 @ 3:09 pm
One thing I’d like noindex (or additional method) to do is simply not to follow the link at all.
robots.txt are failure back from old old ages (okok, 1994), but it is not a standard, and it fails at modern dynamic world.
Now once we got pretty URLs, that don’t have ugly question marks, and .cgi extension there, search engines should have more intelligent method to understand which pages are good to go to, and which are not.
Think of sites, that have /calculateMyExpensive report, then add 300 different languages, and end up having multiple different URLs. Describing in robots.txt is overhead, putting into forced structures is not flexible, and spiders/search engines still read links, that nowadays are dynamically generated quite often.
So if search engines could decide on robots.txt 15 years ago, maybe it is time to have to decide on dynamic way to suggest methods to crawlers?
DTSLW Said,
February 25, 2008 @ 3:23 pm
I’ve encountered a few sites that use things like ‘rel=noindex’ and ‘rel=noindex nofollow’ and I’ve never understood how that works. Nofollow is pretty straight forward, but how would noindex work on an outbound link?
Brian lane Said,
February 25, 2008 @ 4:14 pm
Matt,
I was listening to a commercial on XM about small business, which we are one, and they said something to the extent of “Qualify your customer’s needs, then do all you can to satisfy those needs.” The first thing that came to mind was Google. Koodo’s for the G.
Now, this may be a side bar but I think this falls in line with the discussion of NOINDEX and do the SE’s really adhere to the requests of the Webmaster’s.
Over the weekend I found some pages that Yahoo has indexed in a directory that is clearly marked not to follow in the robots.txt file on the site. But they indexed some of those pages anyway.
IMHO, I think that we as webmasters are trying to satisfy the customers needs and control what the SERP’s deliver up to yours and our customer, because we really do share the customer; if you really think about it.
For us, when we say, please don’t Index we have a good reason for it. It would be nice to have some standardization agreed to by all of the major players on how to handle these commands, much like sitemap.org came together, why not do something of the same for standard’s. Wait………isn’t that what the W3C is for?
But what do I know i’m really just the marketing director.
Let me know when you get back in the KY area and I’ll take you and yours out for some ribs or BBQ. I know you’ve been missing that. TTFN.
DG Said,
February 25, 2008 @ 4:30 pm
If somebody placed the “noindex” on the page. That represent, a denial of content. That’s self explanatory.
DG…
Travis Lane Said,
February 25, 2008 @ 4:44 pm
I can’t believe your lawyers are even letting you have this conversation?? If I put a NOINDEX tag on a site and you blatantly ignore it so that you can display the content to your users (justified by you putting their wants over my requests), that opens Google up to a whole host of lawsuits.
You have avoided those lawsuits in the past because the site owners could have put up a robots.txt, noindex, etc. and in most cases chose not to. If you start returning content that has been purposely blocked because you made a business decision to ignore the content owner, YOU ARE HOSED.
Also, I find it interesting that you are willing to break convention/standards because you are afraid of a few high profile sites not appearing. Does this explain why high profile sites can sell links with no fear of de-listing? You can’t very well drop forbes.com and still be considered to have good search results. But the little guy, well they can be squeezed quite easily. Tsk, tsk.
Frederick Gimino Said,
February 25, 2008 @ 4:58 pm
Matt
I am still fairly new to the internet community. So, I can sincerely say that mistakes happen. Unfortunately sometimes mistakes are not mistakes at all but intentional abuse of conventions created by the search engines to assist webmasters which become black hat tools of deception.
I feel that keeping a page indexed even though it has a noindex attribute attached to it in the robots.txt file not only would help newbies (like myself) but help monitor link schemers who try to us the noindex tag to hide their black hat link schemes.
So, I feel that from both the search engine and white hat webmaster standpoint that indexing or archiving a page in some form would be a win win situation. The only instance that I could see that a page would not be considered for indexation is if it had an ssl certificate on it or contained personal information that had that could potentially facilitate id theft or other types of internet crime.
Shawn K. Hall Said,
February 25, 2008 @ 5:19 pm
Frankly Matt, I think it’s absurd Google is considering adding links to pages that have NOINDEX on them, regardless of what the ODP, MS, Yahoo or anyone else is doing. I don’t care if there are a “few” webmasters that are doing it wrong by adding NOINDEX to pages that “should” be indexed. The simple matter is, MOST webmasters are using it the way it was intended. Are you really going to punish all the webmasters that care about how and where their content appears because of a few idiots?
And Travis has it right, too. If you do start indexing pages that have NOINDEX on them, and you show ads on those pages with their URLs in the SERPS, it becomes obvious that you’re refusing to comply with the web