Okay, that last post was pretty earnest, so I feel the need to post something really technical now. At SES New York, someone asked “Why don’t you provide a parameter, like ‘?googlebot=nocrawl’ to say ‘Googlebot, don’t index this page’?”
That was a pretty good question. The short answer would be that on pages you don’t want indexed by spiders, you can add this meta tag to the page:
<META NAME=”ROBOTS” CONTENT=”NOINDEX”>
You can read more about the noindex and nofollow meta tags on our webmaster pages.
But the user specifically wanted a url parameter. I mentioned that because the parameter “id” is often used for session IDs, Googlebot used to avoid urls with “?id=(let’s say a five digit or larger number)” but that I didn’t know if that was still true. I think someone else nearby asked “Isn’t that kind of an ugly hack though?” and I had to fall back on “You asked for something that worked, not something that was pretty.” The questioner persisted, but I was out of other ways to do it, so I said I’d pass the feedback on, namely “someone wants a url parameter that’s keeps Googlebot from indexing the page.”
That question came up again today, and I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like ‘*’ in robots.txt). Here’s how:
1. Add the parameter like ‘https://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’ to pages that you don’t want fetched by Googlebot.
2. Add the following to your robots.txt:
User-agent: Googlebot
Disallow: *googlebot=nocrawl
That’s it. We may see links to the pages with the nocrawl parameter, but we won’t crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn’t ever fetch the page.
Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. I’ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadn’t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones I’ve mentioned so far) let Google into more of their pages these days.
That’s why we might show uncrawled urls in response to a query, even if we can’t fetch a url because of robots.txt. So ‘googlebot=nocrawl’ pages might show up as uncrawled. The two preferred ways to have the pages not even show up in Google would be A) to use the “noindex” meta tag that I mentioned above, or B) to use the url removal tool that Google provides. I’ve seen too many people make a mistake with option B and shoot themselves in the foot, so I would recommend just going with the noindex meta tag if you don’t want a page indexed.
Obscure note #2: You might think that the robots.txt that I gave would block a url only if it ends in ‘googlebot=nocrawl’, but in fact Google would match that parameter anywhere in the url. If (for some weird reason), you only wanted to block a url from crawling if ‘googlebot=nocrawl’ was the last thing on the line, you could use the ‘$’ character to signify the end of the line, like this:
User-agent: Googlebot
Disallow: *googlebot=nocrawl$
Using that robots.txt would block the url
https://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl
but not the url
https://www.mattcutts.com/blog/somepost.html?googlebot=nocrawl&option=value.
If you hung on all the way to the end of this post, good for you! You know stuff that most people don’t know about Google now. If you want to try other experiments with robots.txt without any risk at all, use our robots.txt checker built into Sitemaps. It uses the same logic that the real Googlebot uses; that’s how I tested the stuff above.
That’s a pretty sneaky workaround Matt – I like it!
Hey Matt, can you provide further input on the ?id thing?
i’ve built sites before that have stuff like:
mydomain.com?id=aboutus or mydomain.com?id=faq etc… as a templated site system.
Since I used id as the parameter, would google follow these?
One site of an ex client of mine (i dont do web design anymore) uses ?pid= in the url to determine which content to pull from the database.. and doing a site:www.dpbrownofdetroit.com shows that google has fetched content for all of their pages EXCEPT the ones with the ?pid= in the URL.
Ryan, to be extra careful, I would avoid “?id=” if you can. It’s used for session IDs so often that the last time they updated the webmaster guidelines (see
http://www.google.com/webmasters/guidelines.html
) I had them add this text as one of the technical guidelines:
“Don’t use “&id=” as a parameter in your URLs, as we don’t include these pages in our index.”
I didn’t think we would do substring matching (so I would have assumed pid= was fine, just as whateverid= would be).
Matt,
I had a client ask about this recently. I uploaded a new robots.txt file and submitted removal requests last week. As of today, they are still showing as pending. Any thing that can be done to expedite this process?
Thanks.
Matt,
Hope you don’t mind me butting in on your robots.txt thread, but wondering if you might be able to look at the URL in my signature where I discuss GoogleBowling.
I’ve heard of this approach used by nefarious black hats to knock competing sites out of the SERP’s. Didn’t seem right to me initially that a 3rd party should be able to sabotage your rankings, but I now believe it is true. Remember you added the word “almost” sometime ago to your Webmaster Fact/Fiction about this.
As noted on the page, while ideally this doesn’t seem right, I can see it being a pragmatic solution to real-world web spam. I.e. a sudden surge of external backlinks (often with the same anchor text) has the signature of unnatural linking … so algorithmically, it would be difficult/impossible to determine if the webmaster is responsible for these, or someone else.
In the case study presented, the MSN and Yahoo rankings appear to be unaffected – suggests they don’t employ this next-generation anti-spam technique yet … although I suggest they should as it is a continual arms race.
BTW, in my case, it may have been natural exuberance of the folks that want to help me out. I don’t have the tools to investigate/determine if black hats took a swipe at me.
Love to hear any insights from you and/or a future post on this topic as I believe of broad interest.
P.S. So do Googlebot’s follow Asimov’s Three Laws of Robotics?!? 😉
Matt, this reminds me: does the Googlebot support the crawl-delay parameter in robots.txt? I’ve heard differing opinions on this. The Googlebot can hammer pretty hard at times. 😉
I’d also like to hear more about the ?id= thing … The webmaster guidelines have discouraged “id” as a querystring for some time. I have a lot of clients who are still using it, however, and up until recently it hasn’t seemed to have preventing them from getting indexed (can’t say the same for rankings, of course). However, it seems that very recently the amount these urls have been indexed is declining. This is causing a lot of debate for us with our clients regarding just how important this advice from the guidelines is.
I would certainly appreciate a definitive statement from the Google guru on the issue so I could settle the issue definitively for our company. Thanks!
-bullfrog
Hi Matt,
I don’t want to hijack this thread (yes I do), but you brought up the topic of URL-only showing in the results, and I have a question about. It came up in a current forum discussion, and, although it’s purely academic, I find it interesting.
When Google finds a linked URL, and the target page hasn’t been crawled, you store the link data (link text and URL) somewhere, so that it can be used for ranking purposes, and you sometimes you show the URL-only listing for the target page, because of that link text. In what index is that data stored?
I know that the URL is placed in the list to crawl, but it’s the link text words, and the data that associates the words with the target URL that I’m interested in.
My theory is that it is stored in the short index (“fancy hits” index) straight away, because it needs to end up in there anyway, and it can be used as normal when processing a search query, without the target page needing to be indexed. Somebody else suggested that it is stored in the Supplemental index, and he pointed to some evidence that could indicate it. I disagreed because URL-only listings don’t state “Supplemental result”, and it makes more sense to me if the data is stored in the short index straight away.
Are either of us correct? If not, which index is the data stored in?
sounds like that’s dipping into the secret sauce.. I’ll wager $5 to $1 he doesn’t answer..
Matt –
We set up a “travel reviews” CGI routine and forgot to ban the bot from it, so it followed the link and indexed thousands of pages that show as “empty” reviews.
Question: We’ll ban the bot now using robots.txt, but how important is it to find a way to get all those existing duplicate pages deleted? Will they just fall out now that we’ve corrected the robots.txt or will they persist and hurt us as duplicate content?
** I disagreed because URL-only listings don’t state “Supplemental result” **
Correction: There ARE very many URL-only listings out there that DO state “Supplemental Result”.
So…
What’s the fast and clean way to get some URLs (say 5000) out of Googles index? I have a few sites with 30k+ URLs where perhaps 5k of those have the robots=noindex tag on them (since about a year) but they’re still showing in Googles index; about 2k URLs have been returning 404 for about 9 months now. They’re still in the index.
Is getting indexed a one-way path? And please don’t make me manually submit them all to be removed :-).
Similarly — how can we clean up the URLs indexed? Say I have an old forum that has been indexed with the wildest of parameters in the URLs; how can I tell Google to clean those up (without manually 301-ing them all, cloaking the redirect for the SEs)? Wouldn’t I be lining up for duplicate content if I submitted a Sitemap-file for the new URLs?
All I want is to give Google a chance to serve visitors a clean index (for my sites at least) 😀
**Correction: There ARE very many URL-only listings out there that DO state “Supplemental Result”. **
ONLY if the have been previously crawled and has been indexed. If it’s allowed they’ll show a cache link as well.
URL only listings where the page has yet to be indexed for the first time do not show the supplemental tag despite evidence to the contrary.
Does Google still follow rel=nofollow links? That seems cleaner than futzing with the URL itself.
In my experience it does still follow those, which is unfortunate and surprising; in some cases these are URLs with side effects (maybe not dangerous, but annoying, like a page that counts reads or votes), or expensive/boring pages to build (like indexes of backlinks). I’d rather not have to use Javascript to keep Google from polling these pages, but it looks like it might be the only way.
Hey Matt,
Thanks for the clarifications. I’ve been trying to figure out the robots.txt exclusion format this week.
In regards to URL Removal. . I’ve got one that’s been pending for almost 10 days. .
I filed a TT and the response was that the pages would eventually be removed during the natural crawling cycle. ?
Take care,
-jay
Using the [meta name=”robots” content=”noindex”] tag on all the pages having the URL variations that you do not want indexed is the best way to get them removed.
If it is a scripted site, then the script just needs to test what parameters are in the URL that called it, and act accordingly. So, print-friendly and other such duplicate pages are easily de-listed.
I may lose, but I’ll take your wager, Ryan 🙂
Thanks, Ian. I’ve only noticed the not-apparently-supplemental kind, but I haven’t done an extensive study, as it’s just out of academic interest, and doesn’t have any effect on anything from our side of things.
Phil, I thought you took no post to the contary from Matt as you being correct 🙂
What would happen in the following situation?
John Doe uses a robots.txt on his domain “johndoe.com” with the “Disallow: *googlebot=nocrawl” wildcard.
Mary Doe places a link on her site “marydoe.com” to John’s site with the URL:
http://www.johndoe.com/anywebpage.html?googlebot=nocrawl
What would Gogglebot do in this situation when following the link? As John Doe has that wildcard on his robots.txt would Googlebot index the page?
This might be a bunk question, but I am forced to ask it 😉
So given your examples of when URL only listings are useful what about adding another variable to the equation? ie: if a page is explicitly blocked and nobody links to that page from a different site then can’t you trust the content author that the content is of poor quality and/or it should not be indexed or listed for one reason or another?
If people cite a page from another site then perhaps it is worth it to still allow that to be URL only or a DMOZ modified listing, but if a page has no external citations why show it for searches like site: searches?
Many content management systems like MovableType 3.12 (I think that was the number) produce nuclear waste comment redirect pages that I believe add little to no value to any search index. To put that in perspective on one of my sites I have about 1,500 pages of content and 17,000 pages in your index.
No Dave. No reply from Matt simply means that we don’t know the true answer.
LOL! That’s not what you have said before 🙂
Matt,
Could you please clarify this:
“Ryan, to be extra careful, I would avoid “?id=” if you can. It’s used for session IDs so often that the last time they updated the webmaster guidelines (see
http://www.google.com/webmasters/guidelines.html
) I had them add this text as one of the technical guidelines:
“Don’t use “&id=” as a parameter in your URLs, as we don’t include these pages in our index.”
Clarification #1:
The reference in the Google Webmaster guidelines says
‘ampersand id equals’ &id= not
‘questionmark id equals’ ?id=
but you seem to be referring to them interchangeably?
Should the Webmaster guidelines be changed to say “Don’t use “&id=” or “?id=” as a parameter in your URLs, as we don’t include these pages in our index.”
Clarification #2:
I haven’t ever understood under what circumstances pages with urls which include &id= are actually excluded from the index?
Clearly – pages with URLS which include the parameter “&id=” ARE included in the index. Look at some of the URLs returned from these searches:
http://www.google.com/search?num=100&hl=en&lr=&as_qdr=all&q=allinurl%3A+id+%26+%3D
http://www.google.com/search?num=100&hl=en&lr=&q=site%3Aaddons.mozilla.org+%26id%3D
http://www.google.com/search?num=100&hl=en&lr=&q=mambo+%26id%3D&btnG=Search
So could you please clarify under what circumstances pages whose URL string includes “&id=” excluded from the index? More than two occurances of ‘&id=” in one URL?; more than three occurances?; or some other criteria?
Either way, the guideline is currently misleading, as some pages which include “&id=” in their URL are in fact included in the index.
You should know by now that Matt doesn’t answer these special sauce questions Mr. Wall. I can’t post comments in your blog anymore and the email I sent you seems to have disappeared into a black hole. I was also having trouble posting on TW but I figured a way around it. You still friends with Nick Wilson?
I have a client that wants a part of their website to be private. So there will be a login-procedure and all that for people to get in, and I will do the normal stuff (robot.txt and rel no follow and to keep google and other search-engines out. BUT if the URL shows up in the search-engines, hackers would probably be able to get in anyway, wouldn’t they? So I really would prefer for google to just not index even the URL’s.
I hope google will consider this.
Hi Matt,
First, please excuse my English, I’m French.
Secondly, about the parameter ‘?googlebot=nocrawl’, okay I can use it to ‘noidex’ pages but have you think about this solution, when you have to develop it on a CMS ? It will take time, and of course I think that’s why lots of webmsater won’t use this solution.
Another thing, if I implement this solution on my website, I must develop cloaking pages (with this parameter) to have clear URL, first for netsurfers, then for other se (sorry :p). This alternative disggust me, because I always think of the accessibility of my website, and so that’s the second reason not to use this solution.
Why don’t simply use the W3C recommandations (robots.txt and meta robots) instead of creating dedicated solutions ?
You are mistaken, Dave. I made statements the last time, and Matt could have corrected them if he’d wanted to. This time I’m not making a statement – I’m asking a question. Simple, huh?
Matt,
thanks for this article. Is having an error or a problem with &id= or the robots.txt the reason for the following mysterious symptom?:
Since July 2005 the Googlebot2.1 spiders our site (about 4.000.000 sites in BD-DC) only 1000 hits/day. Before July-2005 the Bot spidering about 10.000-20.000 sites/day. This “only 1000 hits/day” problem is still on till today.
The MozillaBot does not have this symptom.
Greetings form snowy Germany,
Markus
Matt Cutts:
I don’t think I will post here any more.
But please check my Google Story:
http://www.pennyrank.com/story-of-google.jpg (Google’s Market Share %)
And if possible forward to Larry and Sergey.
It’s getting a little old hearing those who lost their webmaster welfare checks for having low quality sites taking weak snipes at google. I will lose a few more friends saying this but it needs to be. It’s like biting the hand that fed you. Do you try to destroy your dad after he raises you and tells you that you will now have to fend for yourself? Grow a set and go out there and make something useful!
Am I out of line?
That’s really funny. “In 2003 Microsoft staff said they don’t see Firefox a threat“. They said that in 2003??? I don’t need to explain it, do I, Rahul? 😉
Rahul. If you’re suffering because of Google, you’re business model in totally wrong. In other words, it’s entirely your own fault.
Oops PhilC, its 2004:
http://www.zdnet.com.au/news/0,39023165,39166227,00.htm
“Microsoft: Firefox does not threaten IE’s market share”.
I agree with them.
Keith, let me check on this. Were the results supplemental or regular? Were the removal requests via the url removal tool or by email? For regular results, you would normally just have to wait a few days for us to pick up the new robots.txt file.
Dave, Googlebot doesn’t support the Crawl-Delay suggestion in robots.txt. I intend to do a post about why not at some point. If you’re impatient, you can listen to the MP3 of pundits of search from the SES NYC show on webmasterradio.fm. I talked about why we don’t support crawl-delay there. I would like our crawl team to support some way of reporting how much to throttle Googlebot though.
Bullfrog, I personally would not use urls with “?id=” in them. I do think it’s worth the effort to switch if you’re doing so. I’ve talked to 1-2 people who weren’t crawled the way that they wanted, and this was the cause.
PhilC, in general for the regular crawl, references to uncrawled urls are placed in the regular index, not the supplemental index. So you’re correct on that point. Did anyone bet with Ryan? 🙂
Hmm. A more correct way to put it would be that there is a regular Googlebot and a supplemental Googlebot (though their user agents will be the same), and uncrawled urls from the regular Googlebot will go in the regular index while uncrawled urls from the supplemental Googlebot will go in the supplemental index. Hope that makes sense; I believe that’s correct.
JohnMu, I’d hold on. Having the older urls shouldn’t cause any problems. If we refresh the supplemental index anytime soon, that would help too.
Joseph Hunkins, I would just put the robots.txt up and let them fall out naturally.
Aaron Wall, I take your point. But consider the case of namebase.org. That’s the site that D.B. wanted Google to crawl more deeply. Suppose there’s a page A on namebase.org that doesn’t have any off-domain links, only internal links. I think the site owner of that domain would still want us to return the url reference if we thought it was relevant. Just to be clear, I’m saying that most webmasters probably want as much representation in Google’s index as they can get. But you raise a fine point: if something is blocked in robots.txt and there’s no external links to it, that’s a case where we could consider not showing the url reference. I’ll have to think some more about that case. My guess is that cases like that aren’t enormously common; do you agree?
Hey Matt,
Off-topic post (sorry, but it just occurred to me last night):
Is there any way to use the toolbar to specify a datacenter by IP address? I can’t see any, and I was hoping to put one of the BigDaddy datacenters in there.
If not, that might be a good thing for a future toolbar release.
By the way, Aaron Pratt is right: most of the bitching, whining, moaning, snivelling and complaining come from those who think they got cheated, not by those with the best interests of the engine and its end users at heart.
Matt,
Is there a way for webmasters to have Google delete non-existent web pages. The url removal tools only serves as a method of not showing the results in the search but the pages are still in Google’s database and have a tendency to go straight to the supplemental index once the removal request expires.
I get the impression that Google just doesn’t want to let go of pages and archives them irrespective of webmasters wishes even if they don’t show for any search results.
I have seen a site that clearly has noindex in the meta tag and if you search on Google for the site it comes back with no results found but Google still has a cache of the page complete with the noindex meta tag.
Does the noindex tag mean that you will still crawl and cache the page just not show it in the results?
Going a bit OT but does anyone know if having 2 pages the same but one of them having &offset=0 at the end of it can cause duplicate content problems?
In my directory, after going to the 2nd page of a category the link back to the first page of the category actualy goes back to “page-name&offset=0”
If this is likely to cause a problem, is there anything I can put in the robots.txt to help?
TIA.
T.J.: what language is that in? Because if it’s ASP, then I could probably give you a code snippet that would eliminate the need for the offset=0 part of the link.
Phil, are you this much fun at parties? chill man, it was a joke, hence the smiley 🙂
Valentine – Maybe they hold the database for awhile on deleted pages as a kind of a criminal record. Just as Matt stated, set a 404 for deleted pages and don’t worry about it, works for me.
Adam – For fun when someone bitches in here and posts a URL use the WayBackMachine to checkout there site which is kind of what Google might do with the “database” Valentine speaks of.
There is a lot of love in this post now can you feel it? 😉
@Matt
Do you have new information about the Supplemental Hell problem?
It would be also very helpful if you could answer the following questions:
1. Is in any case the Supplemental hell a bug, which would be solved? Or can it be that this is wanted on a long-term basis for some pages?
2. When can we expect a complete solution of this problem?
Thanks Adam,
Yes it is ASP, and yes a snippet of code would be appreciated :o)
I still don’t know if the way I have it could cause duplicate content problems, but would rather not risk it until I here different.
This is the code I am using at the moment to go back to first page in category, from all other pages in the same category.
0 Then %>
“>First
0 %>
Thanks again, and thanks Matt for using your blog as a help page :o)
Sorry, code obviously didn’t post.
Feel free to email me Adam
addy is on site in my sig
(contact page is linked to right at bottom of page)
Many thanks for the explanation, Matt. And, yes, *I* took Ryan’s bet. Ryan: you owe me $5 🙂
Dave – my apologies. It’s sometimes difficult to read what a person really means, even with smilies, and I assumed that you were taking a shot – my mistake.
I think you missed a little bit of Aaron’s point, Matt. He said that, “if a page is explicitly blocked and nobody links to that page from a different site …”. In this case, the owner doesn’t want the page to be listed is the serps – not even as URL-only.
I would go even further and suggest that, when a page is explicitly blocked by the site owner, then don’t list its URL in the serps, regardless of whether or not other sites link to it.
There are many reasons why a webmaster may want to block access to their site or pages. If they do then shouldn’t Google respect that?
With the url removal tool it is only temporary. These pages tend to come back in against your control. Yes, you could 404 them or add noindex meta tags to them but they could stay in there for ages before Google finally crawls the pages again and takes them out of the index.
It serves users no purpose to be presented with pages of results of non-existent web pages or out of date pages that webmasters have specifically said don’t include.
Google bans sites on a regular basis and generally you would need to file a reinclusion request to get back in.
Shouldn’t it work both ways? If you ban Google from pages or sites shouldn’t they ask before they reinclude them? It would be as simple as checking the robots.txt of the site or the pages to see if they still exist and don’t include noindex meta tags.
I think Matt’s excuse as used in his example is pretty lame, “We’d look pretty sad if we didn’t return http://www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from http://www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link.”
When a website says “NO” what gives Google the right to change that to “Maybe”? If a webmaster says “No”, it should mean “No”. No url only, no description from Dmoz, nothing. If Google feels so strongly that they need to return the results then maybe they should contact the webmasters directly to seek permission.
And I’m still puzzled why Google would cache a page that says noindex in meta tags?
Well bad content management systems are common, and when people try to fix the errors caused by them I don’t think cloaking or learning Perl should be a first port of call 😉
RE: “Dave – my apologies. It’s sometimes difficult to read what a person really means, even with smilies, and I assumed that you were taking a shot – my mistake. ”
No problems Phil. I guess I was ‘taking a shot’, but purely in jest.
TJ, here you go:
I figured I’d post it here in case someone else was in your boat. Simply put, IntConverter is a bug-free version of the CInt/CLng functions, since it does a check for non-numeric and null terms.
So, what you can do if you want with your directory link is go right to page-name . Offset will be 0 by default.
Hope that helps ya, buddy. If not, I at least hope it helps anyone who has gotten extremely pissed off by the error that CLng/CInt can throw with nulls and non-numeric terms.
Since I was stupid and didn’t think to ask if this was okay:
Matt, are you okay with us posting functions like I just did to try and help other people?
Matt likely is, not Lost Puppy and Kirby though 😉
hi matt,
we have a lot of our news pages indexed by google (news.php?id=123). after your initial post, we changed it to news.php?news=123. the pages are still reachable via news.php?id=123. we’ve updated our sitemap.xml to news.php?news=123, also.
– will this be punished by google (duplicate content)?
– how long will it take for the old ‘id’ pages to fall out if the index (if at all)?
thx
Hi,
I am wondering why things are made this complicated. If one does not want to have a page crawled, he can use standard robots.txt syntax for that.
If this is difficult, due to the names choosen for the pages and if he follows your suggestion, he should be aware that
‘http://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’ will not be crawled, but that ‘http://www.mattcutts.com/blog/some-random-post.html?’ will still be.
To avoid that, he will have to play with a .htaccess file (or a similar technique) to make sure ‘http://www.mattcutts.com/blog/some-random-post.html?’ is redirected to ‘http://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’.
Not a very nice solution, in my opinion.
Jean-Luc
I’d like to strongly add my support for Aaron and Valentine on this matter. I really can’t fathom the idea that you would ignore an explicit wish by any webmaster to not index a page (or for that matter to not follow a link).
You said…
“Just to be clear, I’m saying that most webmasters probably want as much representation in Google’s index as they can get”
That may well be the case, but just to be clear, if I’ve applied noindex to a page, disallowed a page or directory, or added nofollow to a link, I don’t want ANY representation in the index for that page or directory, and I don’t want the link to be followed. How is there any room for quibbling here? It’s an explicit command.
I mean, it’s not like people routinely create robots.txt files or add meta tags by accident 🙂
I would strongly suggest an alternative command – one that makes more sense, semantically – for allowing the indexing of URL’s only. And I’d bet that it would hardly ever be used.
Damn, you talk too much Adam but please do not refrain from posting some code, us newbies have fun learning about this stuff from those who have been there before. Thanks.
I have a question on querystrings and tracking refferals for marketing purposes and how google indexes these. I’ve tried getting an answer form google’s help email, but I’m still a bit unsure what to do.
Basically I have a number of partner websites who link to us, using a tracking url. i.e. a querystring.
e.g. http://www.quinn-direct.com?advertocode=GOOGLE3
On some partner sites where we have a fixed position, Google is indexing this link, and the unique querystring is being read by google on the partner site, however this tracking url is now coming up on google on our natural listing as the link to our website within search resutls.
Any idea how this can be fixed? is it a case of robots.txt being ammended on our site or on the partner site to exclude ?advertcode=
The result is that people clicking in from our natural listing on google are being tracked as if they’re refferals from the partner site.
help!
Not sure what code you are referring to Aaron. If it’s the last thing I mentioned, then that doesn’t exist 🙂 I was merely suggesting that perhaps it should be created (in much the same way as Google ushered in rel=”nofollow”) in order to cater for those mysterious webmasters who apparently exist that want their URL indexed and listed in the SERP’s but not their content.
Something like would serve the purpose admirably, allowing everyone else to rest safe in the knowledge that (and Disallow in robots.txt) work as their names suggest: ie, the page doesn’t get indexed *at all*.
Bah, code failed to come out…
After ‘something like’ insert ‘meta name=”robots” content=”urlonly”‘
After ‘knowledge that’ insert ‘meta name=”robots” content=”noindex”
Jobs a guddun
Hi Every one,
If I have two domain names “domain.com” and “domain.es” and I only want to be indexed domain.es to avoid SPAM, the other domain.com is used only for people who mistake typing it.
Is there any way to answer it? I can’t us the tag for robots.txt or NOINDEX metatag because the two domain is in the same root IIS.
I think I can clear up the confusion.
There are two Adams. Me, and that other guy who I don’t know and is obviously an inferior carbon copy a la Michael Keaton in Multiplicity. 😉
Seriously, other Adam, you might want to put your last name to distinguish (even though I already did).
Consejos SEO: there is a way in IIS to redirect one site to another using a 301 redirect. Set up your domain.es site the way you normally would, but don’t add in your domain.com host headers.
From there, set up domain.com (and the host header http://www.domain.com) as your second site, and instead of pointing it to your domain.es directory, have it redirect to domain.es and check the box that says “Permanent redirection for this resource.”
Matt — Slightly off topic, but could you give us some hints on how Google would prefer that we set up A/B style tests? If I’m randomly serving up different content on the same URL, or randomly redirecting some visitors to a particular page it could conceivably appear to be cloaking, when really it is just an A/B test. Is it best practice to put a NOINDEX on the test content?
Thanks for any insight you can provide.
How does google or you for that matter feel about dynamic urls. Specifically geo location content. Anything I should know about to avoid with the search engine?
Thanks,
Diana
“Keith, let me check on this. Were the results supplemental or regular? Were the removal requests via the url removal tool or by email? For regular results, you would normally just have to wait a few days for us to pick up the new robots.txt file.”
I used the URL Removal tool at http://services.google.com:8882/urlconsole/controller?cmd=reload&lastcmd=requestStatus&cmd=requestStatus
and it lists everything as pending from 3/9/06. Would an IP change affect this? Also these were listed as regular results. Thanks, Matt.
Thanks again Adam (Senour)
I hate to say, but that is far to complex for me 🙁
HTML, CSS and very basic ASP is about my limit.
I actually got over the problem the easy way.
I left a link up to next page, but removed the link to previous page
and just replaced it with,
“Use your back button to return to previous page.”
It’s a shame though, that the fear of getting penalised by Google
makes us have to do these things.
Hi Matt,
I am new to SEO and have been using sitemaps as a guideline to my website but i just cant seem to get any pages that refer to me (allinurl:www.—-.com). I have had the site for over 1 1/2 and still nothing. Is there a reason why nothing comes up? Thanks for your help.
Ryo
Hmmm…
I guess I could try and explain that function a little more simply.
Basically, you have two cases:
1) A number (2, 1, 5, 9). If it’s a number, then the function will return that number converted to an integer.
2) A non-number (nothing at all, “BOB”, “I ROCK”). You get 0 in those cases.
So…if you don’t pass an offset querystring in your example above, you’ll still get 0 back from your page.
I can’t explain it much more simply than that, I don’t think. Maybe another programming geek would like to give it a shot.
Matt,
Correct me if I am wrong here is one site http://www.ezinearticles.com that uses &id= …
It’s ranking well in google SERPs so I guess it would be safe for me to say this is not true at present?
Matt,
Google supports wildcards like ‘*’ in robots.txt, but the google “remove your url” system doesn’t, I get this:
>URLs cannot have wild cards in them (e.g. “*”). The following line >contains a wild card:DISALLOW /*PHPSESSID
why? it doesn’t help a lot using wildcards on the robots.txt if google doesn’t accept fast url removal.
Adam Senour thanks for you reference but in some sites I find two options.
Wich of these is better and no abuse the SEO advise.
Configuring the Apache server:
ServerName http://www.domain.com
Redirect / http://www.domain.es
OR using these content for the index.php in domain.com:
Best regards.
header(“HTTP/1.1 301 Moved Permanently”);
header(“status 301 Moved Permanently”);
header(“Location: http://www.domain.es“);
What if Googlebot isn’t re-crawling pages fpr updating due to the long session id tag in parameters, is there anyway to get these kinds of pages updated that have already been crawled with the session id?
**Hmm. A more correct way to put it would be that there is a regular Googlebot and a supplemental Googlebot (though their user agents will be the same), and uncrawled urls from the regular Googlebot will go in the regular index while uncrawled urls from the supplemental Googlebot will go in the supplemental index. Hope that makes sense; I believe that’s correct.**
Matt,
Can you, or anyone else, show me an example of a uncrawled URL from the supplemental index in that shows up in the search results?
Thanx 🙂
Servus Matt,
read your article twice but I guess my english is just not good enought to understand. Anyway maybe someone has a solution for my problem:
I’m the admin of a German karting website, just for fun and private but I think it is quite good (around 120 unique visitors per day). So now after the site is up for 1,5 years it is time to look what could be improofed. Well there is for sure a bunch of stuff to do and that is how I found your blog.
Q1:
in my robot.txt i have:
User-agent: *
Disallow: /administrator/
So why do i see google bot checking a page that you can only reach after loging into the admin backend O_o
http://www.domain.de/administrator/index2.php?option=com_content&…….
Q2:
The Site has the feature to print articles (usually about races and stuff like that) as a pdf well seems like the google bot really likes that feature – it indexes the pdf files rather than the actuall page with doen’t help the user cause the pdf engine doesn’t support pictures.
Here is a link as it could be:
http://www.domain.de/index2.php?option=com_content&do_pdf=1&id=80
So how do I tell google bot not crawl these links? As the site is based on the joomla/mambo cms it is not easy to make changes on the system itself so I would prefer a work around. Btw. what would you guys guess, will google then index any site at all?
Maybe I should leave it with that but now that I have started here is another question – witch url is the better one 😉
Q3:
http://www.domain.de/index.php?option=com_content&task=view&id=76&Itemid=1
or
http://www.domain.de/content/view/76/1/
and what about this one?
http://www.domain.de/component/option,com_docman/task,cat_view/gid,80/
So thanks to anyone,
Sebastian
just in case someone wants to mail directly: #azreael#ät#web#pünkt#de#
Matt – regarding the first couple sentences of your post – it looks as though Google is in some cases indexing pages with “id” as a parameter – see the first result (teleflora) on this search:
http://www.google.com/search?q=dozen+premium+red+roses
The resulting page is http://www.teleflora.com/product.asp?id=34529 Does Google recognize that with Teleflora “id” is not a session ID but rather their product sku? Does Google SiteMaps help is this is a concern one might have?
Matt,
The removal finally went through. Thanks for looking into it.
Could collecting info by crawling url’s/content that are forbidden by the webmaster be construed as theft or invasion of privacy? If not, why not?
Can anyone provide me the basic logic with which google starts crawling . This is not the logic of crawling a particular website , but the logic with which Google finds out a newly introduced Website ( with new IP and Domain name)
Urgently needed
Regards –
Hi Googlebot search logic.
I still don’t know the logic of Google to find brandnew website. But I know some tactic for them to find you.
You can post your URL to some famous forums in which googlebot frequently visits. Then googlebot will see the link to your site there and start to crawl you.
Hai 🙂
Awesome! I love the ?googlecrawl=no idea!
robots.txt examples for phpBB and WordPress
Hi Matt, what is your stand on this issue
http://www.youtube.com/watch?v=nLB1-Kc4CWE
all Blogger blogs using custom domains are not being indexd and cached after 30 July. No Google employee is ressponding on this issue in webmaster help group. Where elsse we can hope for the answer?
Please take a look of this issue.
Thanks,
Sunil
Matt,
can “?googlecrawl=no” be implemented as an alternative to the “nofollow” attribute? If not, why is that?
Thanks,
John