Search results in search results

I was reading an interesting question on Google’s webmaster help group that was posted a few weeks ago. The question was

Is there any official Google statement regarding that search result on
one’s own site ought to be disallowed from indexing (e.g. via
robots.txt)?

and the questioner went on to mention that YouTube’s search results were showing up in Google. Vanessa Fox showed up to tackle the answer:

Typically, web search results don’t add value to users, and since our
core goal is to provide the best search results possible, we generally
exclude search results from our web search index. (Not all URLs that
contains things like “/results” or “/search” are search results, of
course.)

I’ll take a look at the YouTube example. Thanks.

As a result of that question, YouTube added a “Disallow: /results” line in its robots.txt file. That’s good because as Google recrawls web pages, we’ll see that and begin to drop those search results.

Google already does similar things with our web search results, Froogle, etc. to try to prevent our web search results from causing problems for any other engines’ index. In general, we’ve seen that users usually don’t want to see search results (or copies of websites via proxies) in their search results. Proxied copies of websites and search results that don’t add much value already fall under our quality guidelines (e.g. “Don’t create multiple pages, subdomains, or domains with substantially duplicate content.” and “Avoid “doorway” pages created just for search engines, or other “cookie cutter” approaches…”), so Google does take action to reduce the impact of those pages in our index.

But just to close the loop on the original question on that thread and clarify that Google reserves the right to reduce the impact of search results and proxied copies of web sites on users, Vanessa also had someone add a line to the quality guidelines page. The new webmaster guideline that you’ll see on that page says “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”

This hasn’t been a burning issue for many people, and for people that pay attention to search I’m sure it’s a well-known fact (e.g. see here where someone asked me about a particular site copied via a proxy, and my reply later that day), but it’s still good to clarify that Google does reserve the right to take action to reduce search results (and proxied copies of websites) in our own search results.

Philipp, thanks for asking the question originally. It was good that you pointed out we had some of our own web search results showing (so that we could correct that), and it’s also good to make sure that site owners get clear guidance.

58 Responses to Search results in search results (Leave a comment)

  1. Google surely reacts to everything “Philipp” says.

    Just an observation.

  2. One proxy that I’ve seen popping up in Google’s search results is zta.net – just search for [site:zta.net] to see how many sites they’re copying. Since they seem to be hosting (live?) duplicate content on subdomains, I was worried they could rank higher than official sites and perhaps get some sites penalized for hosting duplicate content at multiple domains.

    Should those pages be included in Google’s search results?

  3. That’s a good move to ask webmasters to disallow search URLs, but I guess many of them using CMS don’t even know about this issue : http://www.google.com/search?q=inurl:/index.php/?q

    But what about webmasters creating hundreds of thousands of ‘fake’ search pages on which you find only AdSense and links created on the fly to the same search on others sites of them ?

    Do Google plan to clean those pages from the index too ? That’s would be great, because spam report don’t seems to be very effective on this issue

  4. Here is an example of how it’s done, so that everyone can implement it right away:

    Step 1 : Edit your robots.txt (if you don’t have one, create it)
    Step 2: The file must say:

    User-agent: *
    Disallow: /directory-not-for-robots-or-spiders

    Step 3: Save the file and upload it to your server.

    Replace “directory-not-for-robots-or-spiders” with whatever directory you need
    ———-
    Say you want to only prohibit Google, but allow the rest, then change the * for the term Googlebot:

    User-Agent: Googlebot
    Disallow: /directory-not-for-robots-or-spiders
    ———-
    Say you want to prohibit multiple directories:

    User-agent: *
    Disallow: /directory1
    Disallow: /directory2/subdirectory/
    Disallow: /directory3
    ———-
    And if you want to test that everything went on smoothly, login to your Google’s Webmaster Tools, and under the diagnostics tab there’s a Robot.txt Tester.

    Happy surfing!

    Luis Alberto

  5. Hmmm Matt, so my next question would be, if we know of certain sites using this technique to inflate their page numbers, do we report them as spam? How do we go around this?

    Let me place a common example, Technorati. They have search results (254,000 counts), and tag results (964,000 counts). They are both indexed in google and often come out in results.

    Thanks!

    Luis Alberto

  6. Search Results are an asset to the SERPs, and should be included. They do potentially offer the same impacts as the ‘Search Suggestions’ now at the bottom of Google’s organic SERPs.

    Click on RELEVANT RELATED search results can open an entire new window.

    Pehaps an algo combination that decides based on TRUSTRANK and Click Popularity which search results should be allowed to rank higher in organic serps should be considered.

    But there is potential with this resource and they should NOT be automatically excluded.

    All Information has some validity! 😐

  7. as for Tony’s question:

    “One proxy that I’ve seen popping up in Google’s search results is zta.net – just search for [site:zta.net] to see how many sites they’re copying. Since they seem to be hosting (live?) duplicate content on subdomains, I was worried they could rank higher than official sites and perhaps get some sites penalized for hosting duplicate content at multiple domains.

    Should those pages be included in Google’s search results? ”

    _______

    I’ve clicked on some of the copies of these sites (url appended with a zta.net), most of the time I went to a zfs.org, which seems to be associated with zta.net.
    Should this zta.net as well as zfs.org be investigated?

    thanks

  8. Aaron Pratt, Philipp often asks good questions, but I don’t respond to everything he says. 🙂

    Tony Ruscoe and meng xiang, I’ll ask someone to look into it.

    Sergi, bear in mind Vanessa’s point that “q=” or “/search” doesn’t mean that something is necessarily a search results page. As far as your other point, I’m always happy to get spam reports of any ‘fake’ search engines. That stuff is right down my webspam alley. 🙂

  9. thanks Matt :D.
    I’ve been reading your blog for a while, it’s really a good place to learn about anti-spam and search quality and everything.

  10. Stupid side question in all of this: what about approaching the issue of fake search engine/auto-generated page spam from a slightly different angle and trying to auto-detect/ban from Adsense and possibly even Adwords? A lot of these guys are making money off of one or both of these programs, money that could justify their efforts even if they get banned from the organic SERPs (which they quite often don’t).

    Maybe by cutting off the money flow, the spam doesn’t flow as freely into organic SERPs (I’m not sure on this because a lot of them will probably turn to Yahoo! or Adbrite or something like that, unless you guys can get together on issues like this.) It wouldn’t become worth it for at least some spammers if there were reduced or no revenues to collect (in theory).

    Anyway, just a random thought.

  11. Matt:

    There are a lot of little initiatives I can envision that would help Google index content better. I am planning to slowly “propose” them on my Well Designed URLs blog [1]. However, I’d like to present them to people at Google and cultivate an open discussion about them because without your involvement and others in the community they will be nothing more the metaphorical hot air.

    Is there any kind of forum to engage people from Google who have the power to recognize new webmaster techniques when indexing in order to help Google and make a better web?

    [1] http://blog.welldesignedurls.org

  12. More on topic, how does Google view web pages like the following? (this is a page I wrote for my _former_ company; they have since updated it just slightly):

    http://www.xtras.net/ComponentsAndToolsForDotNet.asp

  13. Is this not a little too vague?
    “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”
    Would this not be better:
    “Use robots.txt to prevent crawling of all search results pages if any pages link directly to a specific results page. Also use robots.txt to prevent crawling of auto-generated pages that don’t add much value for users coming from search engines.”

  14. I think that’s spam and have to be removed from Google index…

  15. Will this update filter out websites that just point to a URL that has their own overture powered feed?

  16. Cool info thanks Matt & Vanessa.

    > Google surely reacts to everything “Philipp” says.

    Note I often just “forward” questions that pop in the Google Blogoscoped forums or elsewhere…

  17. While I understand your guideline for general web search, in local that seems a little short sighted (and probably anywhere category exploration is critical – so I’m not sure if killing YouTube search/browse pages was a good idea either).

    If I search for “tailor san francisco” I probably want a listing of tailors. Google onebox shows a few results from local, but then shows a single tailor from Yelp (albeit a decent one) below. Since the user is effectively doing a category search/browse isn’t another list of tailors what the user wants? By listing the CitySearch ‘best of” tailors next you’re effectively doing this. However Yelp has essentially the same page, however instead of editorial it’s user-gen with search – but it gets at the same thing, a list of excellent tailors:

    http://www.yelp.com/search?find_desc=tailor&find_loc=San+Francisco%2C+CA&action_search.x=0&action_search.y=0

    You really want to ban this?

  18. Very interesting post.

    What happens vertical search sites like Indeed or Oodle? Is Google going to ban their search results pages from the Google search results? (in a way, this is very similar to what Jeremy says )

  19. Matt, the guidelines has not been updated on the Danish version – and probably not on most others too. And when we (the Danes) follow your link (or type it in) to the guildeines we end up on the Danish page – not the english one you are seing and linking to (yes, thats what we call cloaking … no sorry, personalization, no, I mean geotargeting … well whatever hehe)

  20. Another good one…

    http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLG,GGLG:2006-09,GGLG:en&q=burger+San+Francisco

    Google Onebox: burger king
    CitySearch: editorial list
    About.com: editorial list
    Yelp: “search” over user-gen

  21. Hi Matt,

    Just one question (sorry for my English, I’m French): I submit to Google an exclusion request for one of my website (ambv.free.fr) but it still remains in Google Index in spite of my robots.txt?
    During a little time it disapeared from Google, but he comed back later.
    Does Google Search always respect the robots.txt or does he display results with excluded site when they were the web reference ?

  22. Matt,

    I’ve got to tell you that this opens up, for me, a can of worms. If I’m to understand correctly, dynamic search results might be at danger of being removed from Google SERP’s. My obvious problem with this is that I have sites using dynamic content such as /SearchResult.aspx?CategoryID=4 to produce a page for a particular category of products. Basically, this page (as the “no doi” url implies) is a search result of a query against my database. This was, and is, the promise of dynamic content.

    So this particular stance from Google is very concerning. If your not going to penalize sites like mine, and millions of others, the way that Vanessa Fox suggests, then I apologize for my concerns. But does this also endanger social bookmarking sites that use user generated “tags”? These are also “search results” that provide a lot of content that I feel should be in Google’s SERP’s. How about blog sites that “categorize” or “tag” content so that it provides links to a modified search results, thus generating a page of highly relevant content to a particular term. Your own blog, for example, uses this technology. It’s a search result, no matter how you dress it. Is the difference in the url? We can’t all implement human friendly urls, and we can’t make 1000’s of pages by hand.

    And in all honesty, I think that some of the mentioned websites, like Yelp, in my estimation, are doing nothing wrong. If a “search results” page is highly relevant to a search term, and ultimately serves the Google user, then what is the harm? Doesn’t the Google algo identify relevancy to a term? The YouTube search results page, I think, served the searcher what they wanted, a page relevant to their term.

  23. Hi Matt,

    I wonder about the phrase:

    “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines”.

    When is a page of no value to a user coming from a search engine?

    Let’s take the example on searchengineland (http://searchengineland.com/070312-104201.php) about the shopping.com dvd players. Those results are very usefull for someone intrested in buying a dvd player or someone just seeking information. Is it really such a good idea to ban those kind of page from the Index?

    Grtz
    Joris

  24. Luis Alberto says;

    User-agent: *
    Disallow: /name of directory/

    if I want to block a specific directory this way it workd for all SE robots?
    the above text is all I need to write in robots.txt?

    were can I test my robots.txt file?

    regards

    Frank

  25. Hi Matt,

    I just have a quick question that is in somewhat of correlation with what JeremyCEO@Yelp has suggested. I help run a real estate web site, and for obvious reasons this might have a rather dramatic effect on what we are trying to do. We have used mod_rewrite to make a clean URL that automates our searches.

    “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines”.

    To us and to our end users this does add value, much in the way of how yelp does.

    Any ideas?

    Thank you,

    Brandon

  26. I saw [site:zta.net] in google and all the sub domain go to this url > http://www.zfs.org/home.html

    is it not spam, Matt what about you think

    Deb

  27. Quality guideliness are good to have. However, keep in mind that there are many platforms out there (such as Yahoo! Stores) that do not allow you to have control of your robots.txt. In my opinion, Google should filter these results but not penalize a site for them.

    Saludos,

    Nacho

  28. Hey Matt, thats a nice topic which I had recently thought about for our page. We did a recent relaunch and added a so called “faceted meta data navigation”, which is simply a navigational search that lets the user narrow down his searchresult by selection different facets (attributes). Looking at my logfiles, I saw that Googlebot now crawling every possible permutation of our naviagtional search. I though of adding a “noindex” to all of our search result pages, but besides the fact that I am lazy :), isn’t that something, that I would only do for a search engine and not for the user? There is value being added for the user, which is a convinient way of “exploring” a search results just by navigating. Shouldn’t Google just ignore these similar pages instead of putting a penalty on the webmaster?

  29. Hi Mikkel,

    We are working to get the different language versions updated, and you should see the latest version shortly.

  30. There are Amazon search results being cached under the imdb.com domain:
    http://www.google.com/search?q=site:imdb.com/r/
    These are 302 status code producing redirect links from imdb.com to Amazon, and Google is caching Amazon search results pages under these urls. These 302 redirect links are a valid method of click tracking, but if Google would not follow them (and cache them) these “search results in search results” would not exist… nor would other problems associated with Google’s handling of such redirects.

    Physician, heal thyself.

  31. Hi Matt,
    While you’re cleaning up stuff on other sites :), you might want to pass along that the whois information for Google.ca is incorrect. At the very least the phone number is wrong. I called and the lady was pleasant but distressed to be getting 30-40 phone calls for the toronto office of Google every day.

  32. p.s., Any chance you could email me contact info for the pr folks at google canada? I need some giveaways, a phone call to the toronto office got me an email address, but the email address bounced.

  33. You need giveaways? I didn’t know free stuff from Google was a life or death issue.

    Get that man 3 overpriced pens with about 5 lines’ worth of ink and the Google logo on them, STACKED!

  34. If such an approach is implemented, it will be the end of the aggregator sort of sites and that would be a good thing, because I often found posts in my blog as supplemental results because some aggregator site posted it in real time and was crawled first.

    Regards, George

  35. I’m late to this discussion, but I don’t read any real response to the quite valid points raised by Jeremy at Yelp amongst others. Essentially these search pages are returning a list of relevant items etc based on what the searcher is after. The fact that the urls haven’t been rewritten to say something like site.com/sanfran_burgers.html shouldn’t mean the site in question is penalized for this. Yes, there are clearly cases where this is used for spam, but I don’t see how that means that search results from within a site should always be excluded using robots.txt?

    In fact any site that contains a list of products (like any e-commerce site with hotel listings, books etc. etc.) is essentially serving search results back to the user and these are indexed all the time (and rightfully so I believe). The only difference is they’ve created categories encompassing what the search engine kicks back.

    As a site owner, you have control over the search engine within your site and can create useful pages of aggregated search data from within your site that do actually mean something to your users. I understand based on googles webmaster guidelines that this *should* be ok (since the pages are relevant), but then again, this is what youtube was doing and those pages were removed?!

  36. Just to follow up on that, of course if everytime a user searches it generates a new ‘category’ and a link to that search query then this would be a logical no no.

    But if someone on an external site links to a search result (thus creating a link that a search engine can follow), one can only assume they are linking to it because the search results are actually relevant to what they are discussing.

    I’m sticking to the safe side here and adding a disallow to our search page as I see it is being indexed (although we don’t get any traffic on these indexed pages because we don’t change anything around the search results on the page so they don’t rank well at all for the actual search query), but I would like to see some follow up on this as at the moment it seems to have been put up as a rather impulsive addition to the guidelines.

  37. I agree with Luis Alberto. Technorati and other sites of that nature do not add any value to the search results either in my opinion. They usually only have a few tags and have no information at all. If I cannot use this technique, they shouldn’t either.

  38. Scott, oh but you can.

  39. Heather Paquinas

    Things like this page from ibm can help this problem. It searches for other documents (if there are any) based on your referring keyword.
    http://www.google.com/search?q=%22Modifies+a+document+based%22

  40. Say one had inadvertently allowed search results to be indexed – with a sloppy robots.txt file. Is simply amending the robots.txt the best way to remove these pages? Is that really all it takes? I have heard rumours that robots.txt isn’t always obeyed, or is this actually a mistake by the author (ie another duff robots.txt)?

    Also, could it be that indexed search pages from one’s site could actually be causing duplicate content issues for the pages they represent? I’m thinking product search pages causing problems for the ‘real’ product catalogue pages . Produced out of a database, they will share blocks (product descriptions) of the same content, although not necessarily an identical match of blocks. Hope that makes sense?!

    Thanks for an advice…

    Steve

  41. Well, I’m dobbing in eurekster.com on this one. It annoys the heck out of me to see all their auto generated pages flooding the serps (about 1.3 million showing right now). Essentially every search result is pushed into the index by making it a static URL with search results (probably scraped).

  42. So most bloggers use tags which really are a form of search or auto categorisation of content. Are we saying that this kind of on the fly categorisation could be deemed against google’s quality guide lines? It is my understanding that as long as we slice and dice our content in a way which could be deemed useful to other users then this is a good thing. Could the use of tags be deemed as duplicate content since content can clearly live in more than one place with a tagging system or any dynamic site. Or do the duplicate content guidelines mostly refer to cross domain or multiple site issues?

  43. “copies of websites via proxies” – nice words 😉

  44. I like coffee

    I have a solution for all those who want to keep the indexing in and those who want it out and a way Google can make one more additional revenue stream….you all know where I am going with this right?

    Yes, charge a dollar amount for your site index to be available in the search results. This does many things at once, first of all, gets rid of all the people who are just toying with the system and not producing anything but leaching. Second, it provides legitimacy for the companies who are allowed to have their results show up. Third user search is less crowded with useless garbage no one really can use. Last, Google has one more revenue stream.

    Yes Google, I want 1% of revs for this idea! Thanks.

  45. Hi,

    I think this makes sense for our site. I’ve just added a disallow for anything with the word search in so will see if it makes our site cleaner looking.

  46. Ever googled for black bean dip recipe? cooks.The first result, from cooks.com, is an internal search result. So give us a reason why every site shouldn’t do such a thing, have a “RECENT SEARCHES” widget on every damn page?

  47. Thanks for the tip. I’ll start blocking search results in my ZenCart shops using robots.txt. I hope the default setup of popular open source packages start doing the same.

  48. Cooks.com is still spamming search results with search results. Just google for rosemary potatoes. Don’t worry, you’re still ranking well for bacon polenta 😉

  49. Hello Matt,

    i have query regarding Google Debuts “Search Within A Site” Search Box Feature. my site has good back link and page rank. but when i am searching regarding my site but i cuodunt find this type of feature for my site?

    can u tell me whats the criteria for search within A site search box?

    Regards,
    Dipali

  50. I found this page looking for info on “cooks.com search spam”. Is Google finally going to do something about this website? Every single time I search for a recipe online, I hit this spammy Cooks.com website. Half the time the site isn’t active, and when I can actually access a page, it’s scraped search results.

  51. I completely agree with Jeremy CEO @ Yelp. If a site’s search results page has valuable, unique (i.e. not scraped from other sites) content that satisfies a searcher’s intent, then why would it be considered unworthy for Google search results?!

  52. Robots.txt files are very useful for the following reasons too:

    Saving Bandwidth: Search engine spider robots and web robots visit often our web sites to index changes in our content. We can restrict access to certain directories that we do not want crawlers to access.

    Cleans Up Your Logs: After submitting your web site to the search engines than the spider robots will be searching for your robots.txt file. Every time a robot searches for this file and it does not find one, it generates a ‘404 File Not Found’ error. Adding the file to your root directory will help eliminate these errors and boost your rankings.

    Online Protection: The search engine robots, unless instructed otherwise by your robots.txt file, will attempt to index as much of your web site as possible. This means that they will also access file on the serves that they should not have access to.

    Avoid Unwanted Indexing: If a part of you web site is still being developed than you might want to prevent crawlers from indexing your content.

  53. I’ve been wondering for the past two years what’s the assumed value of having results from people “search engines” aka scrapers in Google’s SERPs? See e.g. http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.123people.com%2Fs%2Fmatt%2Bcutts.

    Now, what’s Google’s stance on showing SERPs in SERPs when all the vertical results are expected to show up in Google’s own unified SERP anyhow?

  54. Dear Matt,
    What about cassified personals or comparative items on Internal search results.

    We have a 2 debates (the one that lead us to your amazing and on the spot blog):

    Debate 1:
    Some people enjoy having a handfull of results in one page. If your shopping or comparing groups of items, potential partners or baseball players, a results page is more apealing. You get to relevant info faster and you get to compare it.

    Other people argue that Google is that “Comparative” list, and that they prefer to go to the individual ad.

    Debate 2:
    Another problem that arises from “search results in search results” comes from the dinamic nature of result pages. Let me illustrate: On “Monday”, result page # 1 will display x,y and z results, and google indexes this URL (Result page 1), as having that information.

    On Tuesday, of the following week, result page #1 will have U, V, W, results, UVW pushed X,Y and Z to resutls page # 2. Results indexed in google, on result page #1 will have been false. thus increasing bounce rate on the page.

    Posible solutions:
    a) Put no-index tags on result pages.
    b) Invert result pagination, giving each set of results a permanent URL, that wont change with time.

    What do you think?

  55. Chris Marttins

    Hi Matt,

    Could you please tell me why Google doesn’t penalize the websites that index their own search results? I’m sick and tired of these kind of websites, and the problem is that they appear on the top 3 search results in Google, for many keywords, without any content whatsoever.

    Here is an example from google search results:
    http://www.google.com/search?q=site%3Acasapariurilor.net%2Fsearch

    40.000 indexed spam pages, and this is not a unique case..

    Thank you for your time,
    Chris

  56. The most popular classified ads in the Philippines sulit.com.ph is spamming Google search results. When I search a certain keyword, sulit.com.ph search results appear in Google search results for many times. Did you ever consider to inspect this site?

  57. Wouldnt you want your Search results being indexed??? Bigger the site the better???

css.php