Duplicate content question

Someone recently asked me

I read this overview of what you said at an SES conference:

Matt Cutts – Google Not prepared, but informal remarks. High order nits: what do people worry about? He often finds that honest webmasters worry about dupe content when they don’t need to. G tries to always return the “best” version of a page. Some people are less conscious. The person claimed he was having problems with dupe content and not appearing in both G and Y. Turns out he had 2500 domains. A lot of people ask about articles split into parts and then printable versions. Do not worry about G penalizing for this. Different top level domains: if you own a .com and a.fr, for example, don’t worry about dupe content in this case. General rule of thumb: think of SE’s as a sort of a hyperactive 4 year old kid that is smart in some ways and not so in others: use KISS rule and keep it simple. Pick a preferred host and stick with it…such as domain.com or www.domain.com.

From http://www.seroundtable.com/archives/003398.html

If this is an accurate summary, and I’m reading what you’re saying, then there’s no need to worry about duplicate content issues when submitting articles. Is that correct?

My response:

What I was saying was: I often get questions from whitehat sites who are worried that they might receive duplicate content penalties because they have the same article in different formats ( e.g. a paginated version and a printer-ready version). While it’s helpful to try to pick one of those articles and exclude the other version from indexing, typically a whitehat site doesn’t neet to worry about 1-3 versions of an article on their own site. However, I would be mindful that taking all your articles and submitting them for syndication all over the place can make it more difficult to determine how much the site wrote its own content vs. just used syndicated content. My advice would be 1) to avoid over-syndicating the articles that you write, and 2) if you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index.

We use additional heuristics of course, but I figured other people might want to hear that take.

80 Comments »

  1. Michael Said,

    February 1, 2008 @ 7:51 am

    I have a poetry blog. I uploaded a couple of new poems lately, and I included links to the original content in my RSS feed, like you had suggested, because another service was scraping my feed. It did no good, though. The other service now ranks for my poems, and the original content cannot be pulled up in Google.

  2. Omar Said,

    February 1, 2008 @ 7:52 am

    In this new world of distributed platforms and no “central” web, syndication is an integral part of any content website’s growth. I know that this idea has been floated before, but having a “no-index” tag around content would go a very long way.

    In the current situation you have two choices:
    1) Use a javascript delivery mechanism
    2) Link back to the original article

    Option 1 has issues because you might want to do some data processing on the delivered data. You can get around this by using a staged process mechanism but that’s just a major pain. Plus some people don’t have javascript enabled, etc etc.

    Option 2 has major issues for the site using the content since they’ll be penalized for duplicate information.

    Therefore having an easy “no-index” tag that we can wrap around content to indicate to the search engines to not read specific text would be great. I’m sure this introduces other issues for YOU guys since it can be open to abuse, but that’s why you guys hire the smart PhDs :)

  3. Jean-Noël Anderruthy Said,

    February 1, 2008 @ 8:09 am

    “make sure that you include a link to the original content” Are you sure that Google takes into consideration? I have some examples that seem to prove otherwise …

  4. Shahid Said,

    February 1, 2008 @ 8:11 am

    Hi Matt,

    Thanks for the clarification.

    With regards to the link back, for example :

    Would you recommend including the link-back on a print friendly page also?

    i.e. View this page online at : xyz.com/some-article/

    Thanks

    Shahid

  5. EGOL Said,

    February 1, 2008 @ 8:29 am

    A lot of sites publish articles on blogs, then large aggregator sites grab them and publish them verbatim or just the content of the feed. These secondary publishers often outrank the original source in the SERPs. I think that a way to identify the source website is needed so that original content is rewarded. Maybe a service such as feedburner could be used to identify the true author?

    Also, if you have the good fortune to have an article make it to the homepage of Slashdot or to Digg that will result in a large number of copies of your content grabbed and published all over the web. Even if that article appears first on a rather powerful site it will be quickly outranked by large scrapers and unauthorized republishers on queries for the exact article title.

    I’d like to see the search engines find more ways to give first publisher the best rankings. It would also be best for search engine visitors because they click straight to the author’s website. This will encourage publication and discourage dupe content entry into the index.

  6. Greg Said,

    February 1, 2008 @ 8:36 am

    Matt,

    What effect does having the same listing data on thousands of real estate sites have? Does it devalue just those pages where the data appears or could there be a penalty to the home page becuase of the duplicant content?

  7. mark Said,

    February 1, 2008 @ 8:39 am

    Thanks for the tip Matt.

    On a somewhat related note…

    What about entire site copies? By “entire” I mean it’s just a one page site of mine that is copied on about 50 other sites b/c people love what I do. lol

  8. suresh Babu Said,

    February 1, 2008 @ 8:44 am

    Thanks Matt.

  9. Kyle Said,

    February 1, 2008 @ 8:48 am

    What about tagging? Organizing the same articles in 38923893 different categories…and in the process, targeting 3829239q122zomg293 keywords than you “normally” would.

  10. Veign Said,

    February 1, 2008 @ 8:49 am

    Yahoo provides a robots-nocontent class tag that can be used to remove content from the page flow from being indexed (or used in determining a pages weight). Does Google support this tag? If not, are there are plans to support such a tag?.

    Reason is headers, footers and even some page content may be repeated throughout a website and it would be nice to force an exclusion from any duplicate content penalties.

  11. chester Said,

    February 1, 2008 @ 9:03 am

    Thanks for clearing that up Matt. I figured as much but wasn’t 100% sure. Now onto bigger and better things.

  12. Emmanuel Said,

    February 1, 2008 @ 9:06 am

    Thanks for the clarification Matt, it makes sense to put the original URL inside the syndicated article.

    The second point (duplicate content for owner of .com and .fr) is very interesting and is worth an explanation I feel. If someone has website.com AND website.fr and put some part of the same genuine content on both, he will not be penalized?

    The other thing I wonder is that how Google assign an article to be original?
    The website that is older in the G index is set as “original” source and Google track duplicates?
    Or is it only by report of the user (DMCA complaint)?
    Or both…

    Thanks

  13. Nino D. Said,

    February 1, 2008 @ 9:25 am

    Hi, Matt,

    In your experience does Google appriciate more if web site has fewer but only valuable pages, or prefers when website has 1.000s of pages, but with no real value content but leading to valuable pages?

    And second question is regarding
    Does this tag tells google:
    Follow links to important pages and take this page into calculation of links, but do not use them for search results.
    Also do not count page as possible duplicate content.

    Only web pages tagged with content=”index,follow” consider for results from my web site.

    Thanks.

  14. Nino D. Said,

    February 1, 2008 @ 9:29 am

    (some text was missing from previous post)

    ….valuable pages?

    And second question is regarding: meta name=”robots” content=”noindex,follow”
    Does this …

  15. youfoundjake Said,

    February 1, 2008 @ 10:01 am

    Off topic, but please tell us your gonna address Microsoft’s bid for a 45 billion buyout of Yahoo. Personally what do you think? I know as a company man, you can’t go near it, but personally?

  16. Max Roeleveld Said,

    February 1, 2008 @ 10:25 am

    There is a different, but somewhat important, side of duplicate content on your site — this will apply to most blogs, but maybe other sites as well. It affected me, anyway.

    I run a WordPress blog, with all the archive pages that come with it. By default, WP will use the full content (or at least up to a more-tag) for those pages. That means that an article might appear in several places, especially when it’s recent:

    - the front page
    - the yearly archive (if you use the semi-standard /archive/year/month/day/postname permalink structure)
    - the montly archive
    - the day archive (same thing as with year)
    - one or more category archives
    - one or more tag archives
    - the single post-page

    Since archive pages tend to receive way more links than single post pages, that might result in the archive pages outranking the single post pages. That’s what happened to me, anyway. It means that new visitors (arriving from some search) might land on a long (archive) page, with the relevant (for them) content somewhere at the bottom. They may scroll, they may not. Most will not, I fear. This means you’re losing potential readers because of duplicate content on your own site.

    Not exactly the same as the duplicate content problems mentioned you mentioned, but it’s still a problem. Of course it would be kinda impossible for google to fix this without adding loads of heuristics. This is a case of rather poor navigation design. And it’s not WordPress-only, lots of blog apps do this. The only way to change this is by making sure single-post pages receive more link love than archive pages, cutting the amount of archive pages, and not holding back on the more-tags and excerpts.

    More on-topic: it should be possible to “backtrace” the origin of a digg by following the backlinks until you can’t go no further. This is most likely the origin, especially if several backtrace routes end up at the same point. But for all I know, you guys may already be doing this. =]

  17. Patsy Sermersheim Said,

    February 1, 2008 @ 10:49 am

    Matt- PLESAE HELP! I have not recieved email for two days nor has the guy who set me up or any of his other clients. We cannot find a love person anywhere at Google to help us with this problem. I have spent two days answering the phone from people telling me my address has “permanent errors”. I was calm yesterday KNOWING it would be fixed today- but not only is it not fixed I am afraid that Google isn’t even aware of this problem!

    You can’t email me- my number is 615 509 7413

  18. Amit Bhawani Said,

    February 1, 2008 @ 11:01 am

    Why not stop worrying about duplicate content , when you know you are not having any real duplicate content. If you are worried about content duplicated in your own blog/website why not add a nofollow to these print pages, tag pages etc. Having a [] in your index.php, archieve.php, tags pages, search result pages etc in order to dislay just a small summary of your article and all of these linking back to the full article page. This way you can guide the search engine bots clearly to the full article because in my opinion , the sitemaps are something which was used 2-3 years ago and now the SE bots are going to index your pages within minutes of posting them.

    Another important point is the domain name/ website authority which is very important which depends on many factors like age of the domain/website with relevant backlinks and content which is useful both for visitors and search engine led visitors, which rank better in serps rather than new websites. In this case even if a authority website copies from a new website they still rank in the top, Example of this being a Article submitted to ezinearticles.com and the same posted in your blog, obviously ezine’s article page would rank #1.

    Originally the point matt has given above makes sense but i feel there are lot more factors related to this Duplicate content and their ranking factors.

    Regards
    Amit Bhawani
    http://www.amitbhawani.com/blog/

  19. John Jones Said,

    February 1, 2008 @ 11:34 am

    “If you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index.”

    Based off of that statement Matt, are you saying that even though the original content was added to my site and indexed by Google first that a site that syndicate’s content could out rank my page for even the simplest term such as the article title?

    What about scrappers that have “more PageRank”; are you telling us that even though I’m the original author and my site had it first that some site can come along and take my content and push my site down?

    Amit Bhawani…

    Duplicate content is an issue for many people simply because we cannot always control who uses our content and who doesn’t. No Follow is helpful but that is only on our end.

    One of my co-workers suggested copyrighting and that is always an option but when your dealing with small business owners who don’t have the time, money or resources to combat any number of scrapers then copyrighting becomes illogical.

  20. Michael Brandon Said,

    February 1, 2008 @ 1:42 pm

    If only your checking of PR and “additional heuristics” would work properly! There seems a lot of Google Collateral damage in this area.

    Why is it that I am seeing many sites disappear from top Google rankings?
    1 - less powerful sites go to 100+
    2 - more powerful sites can just drop from 1st to 10th

    I then check for other sites duplicating the meta description and words including one before and after the main search phrase. After finding many scraper websites, I then change the text on my clients pages, and find their rankings:
    1- come back from the dead the moment that Google recaches the pages 2- slowly come back over weeks after having made the changes
    3- some seem to still have the “penalty” attached to them, despite the change in wording.

    This aspect of the job is taking a good deal of my time.

    My article on this - Change your pages wording frequently

    I did an experiment where I copied an article word for word on another of my sites Duplicate content experiment. From the 18 November 2007 till now there has been a tortuous journey of the pages being decached, the copy showing and my page being ditched to only now, several months later, the correct original page been shown.

  21. Search◊ Engines Web Said,

    February 1, 2008 @ 3:22 pm

    Sometimes - due to a bug in Google - an older page will in fact be deindexed due to duplicate indexing issues if a copy - for whatever reason - gets prioritized on the SERPs.

    This has happened in the past and continues to happen. It could be a backup domain with the same pages duplicated, or even someone replicating a page.

    The duplicate pages sometimes takes the place of the original web page on the SERPs and even replaces it in some of the keyword rankings.

    Sometimes this will take months to self heal - sometimes, years. Also there is a NEW algo threshold of what is considered a duplicate page - it no longer has to be 100% or close - it appears the new algos have lowered the threshold to the majority of a page being duplicated or having duplicate information or SHARDS .

    blog.searchenginewatch.com/blog/060313-090116

    Sometimes this will appear on some datacenters but not others.

    BTW:

    What is your take on the Microsoft Yahoo merger?
    It is frustrating to want to talk about it - but be stymied by corporate policies.
    This would be a perfect forum to debate it.
    Perhaps Google should place its bid for Yahoo :-D

  22. Franky Said,

    February 1, 2008 @ 4:23 pm

    How can i solve this problem with duplicate Content if its used more den 3 times? All Links points to the Homepgae but the Parameter A_ID its a Value to identifies our Affiliates Partners and its used often.

    http://www.site.top/index.php?A_ID=1111
    http://www.site.top/index.php?A_ID=2222
    http://www.site.top/index.php?A_ID=3333
    http://www.site.top/index.php?A_ID=4444
    http://www.site.top/index.php?A_ID=XXXX

    greetings
    Franky

  23. Bean Said,

    February 1, 2008 @ 4:27 pm

    I am currently mid-way through releasing a 10 part series of articles through various article syndication sites.

    A lot of time and effort has gone into producing 10×500 words of interesting reading that may entice other websites to republish my articles (and yes this is purely for linkage purposes from relevant and themed sites…but it’s their choice based on the quality of my content).

    I have not placed the articles on my own site for fear of a penalty, as the syndication sites often get indexed long before mine - shortly followed by scraper sites. A waste of content, but hey ho stuff ‘appens

    Is duplicate content really a penalty or is duplicate content simply ignored?

  24. Omar Khan Said,

    February 1, 2008 @ 5:11 pm

    On the dupe content issue may I ask this general question -

    An 12 year-old, high PR site has 100 pages and images of content (out of 2,000) total. It wants to put the same images and captions on a new domain which adds more functionality to the images (send, buy, save, tag, etc.) and also offers other display options (slide shows). The text, image names and file names (but not domain of course) are the same.

    Does the second new site need to robots.txt out the search engines for fear of hurting the first one?

  25. Dave Said,

    February 1, 2008 @ 5:15 pm

    so would this mean it is not wise to write an original article and post it to 50 article sites along with having that article on your site?

  26. Robert Truog Said,

    February 1, 2008 @ 5:39 pm

    Matt,
    So, am I correct in assuming that creating a mobile version of my website …which would, of course, be an entirely duplicated site, will not create problems. Or should I use “do not index” tags on the mobile version of the site.
    Thanks.

  27. Dave (original) Said,

    February 1, 2008 @ 6:42 pm

    I have never understood why anyone would give away their own site content to another site. Likely started when some “SEO/forum guru” suggested article syndication as SEO.

    Seems to me that scraping site content is problem enough.

  28. Amit Bhawani Said,

    February 1, 2008 @ 7:00 pm

    @ John Jones : If you have a fear of other copying your content , read this guide on how to copyright your content. This way you need to first email the copycat website owners webhost and send them complete information proving you own the original content, and get them remove those articles from their website or else allow them to link back.

    Edit : No idea why every other commentor asks a non-related question @ yahoo-msn merger, dont you guys think matt would have made a post if its something discussing about? :)

    Regards
    Amit Bhawani

  29. SharpSEO Said,

    February 1, 2008 @ 9:25 pm

    Thanks for clarifying the syndication thing. I still talk to lots of people who think syndication can do no wrong. But I’ve seen sites that get outranked fairly consistently for their own syndicated content, even with links in place.

    Shahid - You shouldn’t need to include a link on the printer-friendly page, but it might help if people scrape your pages. It’s best to stop having it indexed altogether, using robots.txt or nofollow. Then you don’t have to worry about dup issues internally.

    If people are linking to your printer-friendly pages instead of regular ones, you could lose some link from not having them crawled. But that should be rare.

  30. GSL Said,

    February 1, 2008 @ 10:03 pm

    Suppose Two sites A and B.

    Now B is using some content that is first published in A. But B is also offering link back to A indicating it as original…..here they played trick and put nofollow tag following link to site A.

    Now the question is, will Google penalize B for this or Not????

  31. Ryan Said,

    February 1, 2008 @ 10:34 pm

    I used to syndicate myself all the time double posting articles on dotcult.com and shoutwire.com Now I simply just include an RSS feed of my posts from one site on the other, and if it’s a bigger post just summarize it and link to it from one of the other blogs.

    I’ve found it works much better to do a new blog post on one that summarizes, references, and links to the other. No penalty, and users start actively trying to read both the full post and the summary and commenting on both.

    SEW, Google sent out a notice telling all employees not to comment on the MS / Yahoo thing either officially or unofficially.. so I don’t think Matt is allowed to.

  32. Goran Giertz Said,

    February 2, 2008 @ 1:13 am

    Hi Matt

    This all confuses me.

    Content is the heart of any website but that is not enough to rate well with Google we need to find quality contextual links to point to us.

    As we choose not to purchase links the next best alternative is to share our content with other credible industry websites. But due to duplicate content, in many instances, we will loose our content value.

    What are your recommendations.

    Have an exSEOllent 2008

  33. Mike Said,

    February 2, 2008 @ 7:38 am

    So, Matt, is PR what actually defines which one is the original article? Don’t bigger PR sites gain an unfair advantage because of this?

    Thanks,
    – Mike

  34. Anthony Lawrence Said,

    February 2, 2008 @ 9:02 am

    “I have never understood why anyone would give away their own site content to another site”

    Well, some of us from the open source world thought it was the “right thing to do” - if code should be freely shared, so should content. So, originally, I licensed my stuff under a Creative Commons license and thought that the little guy who needed a couple of articles would be helped, or if someone needed something for a newsletter..

    Yeah, right: the reality was very different and I quickly changed my mind about that. And it actually makes sense: code executes, web pages are read. They aren’t the same, and there is no reason you need to physically copy my content: link to it.

    If someone does want something for a newsletter or some other special case, I am happy to give permission, but they do have to ask now. No more free copying.

  35. Lee Said,

    February 2, 2008 @ 9:59 am

    perhaps a solution for dup content is a regisery. - URL, article title, date & time stamp. First to register is original author, all else is dup content

  36. Hagrin Said,

    February 2, 2008 @ 12:02 pm

    Matt -

    On the topic of content duplication, I was wondering if you could pass along the following suggestion to the Webmaster Central team -

    Add a “Last Updated On” date to the Diagnostics -> Content Analysis page.

    After going through the posts on Google Groups, it seems that a few webmasters were curious as to the update frequency of that page since they corrected identified issues yet they still display as problems.

  37. Zul Said,

    February 2, 2008 @ 1:57 pm

    Yeah, I know about the same-content penalty and how it could get you de-indexed… But is it a REALLY big deal that we need “to try avoid making duplicate-content at all cost”?

  38. Multi-Worded Adam Said,

    February 2, 2008 @ 3:04 pm

    SEW, Google sent out a notice telling all employees not to comment on the MS / Yahoo thing either officially or unofficially.. so I don’t think Matt is allowed to.

    At the risk of sounding like a smartass, wouldn’t that include talking about the notice itself?

  39. Dave (original) Said,

    February 2, 2008 @ 5:37 pm

    Well, some of us from the open source world thought it was the “right thing to do” - if code should be freely shared, so should content. So, originally, I licensed my stuff under a Creative Commons license and thought that the little guy who needed a couple of articles would be helped, or if someone needed something for a newsletter.

    IF your aim is to expose the content to a wide an audience as possible, why worry which one Google picks?

    I still say the vast majority erroneously give away their content because they THINK it will help with “SEO”.

  40. mybuks Said,

    February 2, 2008 @ 6:23 pm

    Hi Matt, although I subscribe to this feed this is my first time posting a comment. Thanks for all of your good advice!

    I have an issue regarding duplicate content. I am not an “expert” per se, but I am learning from every resource I can find. Here’s the situation. I have been using Blogware (Blogharbor) for a few years now. I have indexed quite high in some keywords and am getting some good daily traffic from the blog itself. I have now taken all of my blog entries and uploaded them to my own domain (…./blog). The concern I have is that now I have identical articles posted on blogharbor and on my blog (/blog). I am not sure how I should work this as far as duplicate content and my previous pages that are indexed with Blogharbor. I don’t want to lose the traffic, but don’t want to be penalized for duplicate content. Should I cut my losses and cancel Blogharbor or deal with the penalty from Google (and the other SE’s) for duplicate content. Any bit of help on this subject would be appreciated.

    Thanks!

  41. Nirmik soni Said,

    February 3, 2008 @ 12:15 am

    Why not stop worrying about duplicate content , when you know you are not having any real duplicate content. If you are worried about content duplicated in your own blog/website why not add a nofollow to these print pages, tag pages etc. Having a [] in your index.php, archieve.php, tags pages, search result pages etc in order to dislay just a small summary of your article and all of these linking back to the full article page. This way you can guide the search engine bots clearly to the full article because in my opinion , the sitemaps are something which was used 2-3 years ago and now the SE bots are going to index your pages within minutes of posting them.

  42. Jean-Noël Anderruthy Said,

    February 3, 2008 @ 3:37 am

    There is a good example on this page : http://www.techwebmedia.com/2008/02/03/google-rank-spammer-who-steal-blog-content-above-authors/

  43. Justin Said,

    February 3, 2008 @ 9:19 am

    Just Feedback! I wanted to drop a note or feedback for Google on a policy in which I don’t agree with. I can remember that Ebay and Amazon both disappeared from your search results for something googlebot didn’t like and disqualified them for the results. Google, fixed the problem within a couple of hours. Yet, when a regular webmaster like myself has a problem with googlebot it has taken almost 3 years to gain back trust from google and still no results. I have done and followed all your steps in your guidelines, read webmaster groups and blogs. I also have talked with pros on webmaster sites and none can find a problem with my business. It has been 3 years. I can understand that amazon and ebay gets a lot more traffic and may effect more employees. However, my business disappearance affects me as much if not a whole lot more than them in comparison. How come google acts like this. If this was an equal Internet or policed and fair your company wouldn’t get away with this. All I am saying is that if you can fix them in hours, how come I can’t get fixed in years, with the help of pros from well respected seos?

  44. Lid Said,

    February 3, 2008 @ 12:53 pm

    Thank you for this Matt, I still have a question that I hope you can help with.

    Not long ago, I moved a wordpress.com blog to GoDaddy using WordPress software.

    The problem I came across was that WordPress.com doesn’t offer 302 redirects, only 301.

    As I’ve been building this blog for over a year now, and although there are not many links in, I still did not want to lose the links that we did have. As a result we used the 301, but it bothers me a bit.

    Does this now pose a problem with duplicate content? If it does, I would love any suggestions/advice if you have the time.

    As the move itself was quite tricky, we compiled a ‘how t’o document for other people wanting to move from wordpress.com to self hosted.

    The only area we didn’t cover was duplicate content as we were not sure how it fits in, and did not want to offer advice that was incorrect. If you have time, I would love your take so we can include it.

    Thanks again for this post!

  45. Chris_D Said,

    February 3, 2008 @ 4:48 pm

    Dave (original) - the vast majority of ‘duplicate’ content is syndicated news content from Reuters etc. Every news organisation on the planet, thousands and thousands of articles / day.
    Nothing to do with “SEO”

  46. Dave (original) Said,

    February 3, 2008 @ 6:44 pm

    Chris_D, yes I know, your point?

  47. IncrediBILL Said,

    February 3, 2008 @ 6:46 pm

    Matt, I remember a conversation we had a while back at the Google Dance @ SES about duplicate content issues and it sounds like there are still issues if you need a “link back” to establish ownership.

    Does this mean that scrapers can still get top billing for your content because they certainly don’t give links.

  48. Dave (original) Said,

    February 3, 2008 @ 9:25 pm

    If *your original pages* are being outranked by a scraper site, you have much bigger issues to worry about, IMO.

    My site pages are frequently scraped, yet not 1 of the scraped pages outranks my originals. Google is pretty good at ensuring only the original ranks, unless one constantly and frequently syndicates their content.

  49. Matt Cutts Said,

    February 3, 2008 @ 10:09 pm

    With regards to the link back, for example :

    Would you recommend including the link-back on a print friendly page also?

    i.e. View this page online at : xyz.com/some-article/

    Shahid, I would pick one version of your article that you want to be preferred (search engineers at Google call it “canonical”) and point everyone to your preferred url for your content.

    Yahoo provides a robots-nocontent class tag that can be used to remove content from the page flow from being indexed (or used in determining a pages weight). Does Google support this tag? If not, are there are plans to support such a tag?.

    Reason is headers, footers and even some page content may be repeated throughout a website and it would be nice to force an exclusion from any duplicate content penalties.

    Veign, we don’t current support that tag, for a couple reason. We think we do pretty well on detecting boilerplate (e.g. you’re not likely to run into any issues of duplicate content for header/footer type stuff). The other reason is that we haven’t seen a lot of sites using the tag after Yahoo mentioned it. Given the choice on where to put engineering resources, not a ton of people have asked for this feature.

  50. Matt Cutts Said,

    February 3, 2008 @ 10:18 pm

    Emmanuel, normally a .com vs. a .fr would have French for the .fr and English for the .com, and in that case there’s almost no way that duplicate content would be an issue between the two sites.

    Max Roeleveld, good points. The recent update in WordPress (2.3) does much better about uniting url aliases under one url. But any software package that allows monthly/daily/yearly archives, tags, etc. will always run the risk of having content appear under different urls. My guess is that over time, both Google and WordPress will get better about such issues.

    Patsy Sermersheim, I don’t have the cycles to contact everyone who is having issues with their site. But Googlebot is designed to handle temporary issues (such as a web server being down) pretty well. If you can reach your web site with your browser, Google can usually crawl it. And if you can’t reach your web site with a browser, that’s the issue I’d concentrate on.

    John Jones, every search engine is going to employ heuristics to try to find the best copy of content. For the most part, those heuristics work well and pick the best copy, but there are definitely steps you can take that make the decisions easier for search engines.

    Omar Khan, I would try to make sure that the newer site has enough new information/copy/details that it’s clearly different from the older site.

    Lid, it sounds like the blog has well and truly moved to a new location. Since it’s not a temporary/transient move, using the 301 sounds perfectly appropriate to me.

  51. Matt Cutts Said,

    February 3, 2008 @ 10:24 pm

    Hey all, my parents came in on Friday and are visiting for a few more days. My Mom wants to kick me off my laptop to catch up on her email? Can you believe that?! I’m all like “Why didn’t you bring a laptop with you?” and she’s all like “I gave birth to you and took care of you for years and raised you well” and in truth, that probably trumps anything I could say. Maybe I should get her some flowers, too. :)

    So I’ll be a little scarce this week on the blog..

  52. Lid Said,

    February 4, 2008 @ 12:09 am

    Hi Matt

    Thank you for your response, it is very kind.

    However, I’m a huge dope and I hope you will not hold that against me.

    I got the numbers back to front - turns out WordPress only offer 302 redirect (temporary) to a new domain - not a 301 (permanent) even though it is a permanent redirect. (Had to write that so I don’t feel I’m going sillier than I am).

    I am sorry for the error, I’m taking huge amounts of antibiotics and pain killers for a tooth ache - and I really hope this is the reason I am dopey today.

    I hope you will take some pity on this person that needs your advice.

    Finally, having little people myself have to tell you - give your mom your laptop! And forget the flowers - think LOTS of hugs!

  53. Nick-search Said,

    February 4, 2008 @ 2:05 am

    I have recently been involved in article writing for my customers, this piece of information is excellent as i have been worried about duplicate content issues.

    I have been using some of the popular article submission sites such as ezine, article hut etc (well here in the UK). And have sometime been submitting the same or a similar article to a blog on my customers site.

    As this creates two separate RSS feeds, one from the article site (which supplies the author with a feed) and one form the blogging software installed on my customers web site will this effect the way that G interprets the text?

    Keep up the good work!

    Nick

  54. Robert Said,

    February 4, 2008 @ 6:18 am

    Hi Matt,

    I have a large website with thousands of different pages. We are currently targeting the global audience, but want to specifically target a certain segment. In terms of duplicate content, can we have exactly the same page (in the same language and format), but just call it a diffrerent domian i.e. .co.uk, .com, .co.nz, .co.au etc etc. Then can we geo-target each different domain to the country.

    If you reply I can give you some further information about this point.

    Thanks

  55. Emmanuel Said,

    February 4, 2008 @ 6:53 am

    Hello Matt, good point for .com and .fr even though sometime some English content are pushed over the .fr because the translation is not done yet

    The question of duplicates remains for website.com and website.co.uk domains (both English content)

    No! do not tell me that we have to write the first content in American-English and translate it for the second site in UK-English :)

  56. Dyna Said,

    February 4, 2008 @ 7:50 am

    Hi,

    Can someone guide here to read more about duplicate content issues related to web page and stripped mobile version of the page?

    - example.com/article-1.html (web page; with lots of stuff)
    - example.com/mobile/1.xhtml (mobile page; only main content)

    I wanted to know -

    - how google bot access these pages
    - how these pages are indexed
    - submitting sitemaps for both in google webmaster tools

    In my earlier experience, I have seen mobile content mixed with normal content in Google search index and that caused landing on the wrong page, i.e. stripped mobile version of the page (and that caused fear of being penalized because of dup content !!)

    I am sure I will get the right answer here…

    Thanks & regards

  57. Emmanuel Said,

    February 4, 2008 @ 8:08 am

    Matt…. here should be your motto: “your mum is ALWAYS right” :)

  58. Harith Said,

    February 4, 2008 @ 8:59 am

    Matt,

    “Maybe I should get her some flowers, too”.

    You go out and buy your Mom a new laptop and keep it at your home for her usage when she is on visit. Because:

    “I gave birth to you and took care of you for years and raised you well” ;)

  59. crexatalyst Said,

    February 4, 2008 @ 10:22 am

    Duplicate content can can drive both parties at risk.,

  60. mr Lhasa Apso Said,

    February 4, 2008 @ 12:14 pm

    Thanks Matt for clearing this issue, I used back-links to my articles wich was the right thing to do.

    best regards

    Frank

  61. Omar Khan Said,

    February 4, 2008 @ 3:33 pm

    Thank-you Matt, I will robots.txt the stuff out. I do think though that Google should find a way not to penalize duplicate content in another presentation form - especially images and captions - and where the two sites link to each other and acknowledge a relationship. But I understand the complexities of doing this in an automated fashion. I appreciate your response.

  62. Summit Said,

    February 4, 2008 @ 8:05 pm

    Google is good but not all that good at picking up dup content. I have seen examples of only slightly changed work indexed for very similar searches.

  63. Deb Said,

    February 5, 2008 @ 1:35 am

    Thanks Matt for your valuable information, and long live Matt-MaMa

    Deb

  64. Michael van Helden Said,

    February 5, 2008 @ 1:55 am

    Matt,.
    One of our clients has multiple domains. Most of the are 301 redirected to the main .com domain. For the dutch language we use a .nl domain, which is not 301 redirected in any way. Since we use another language, I don’t think we will have duplicate content issues.

    But we also use a .be domain (Belgium, with also dutch content). We don’t redirect the be or nl domain to one antother, because we would like to rank high in google.be with the local be domain, and rank high in google.nl with the local nl domain.

    Do you think this is the best setup?
    I appreciate yuor response.

  65. NotHappyJan Said,

    February 5, 2008 @ 2:23 pm

    I’ve currently got a big issue with my site and the consensus after being reviewed by dozens of top SEO’s and Webmasters is it’s due to either duplicate content or a glitch by Google.

    My site was getting 15,000 visitors a day, from some 100,000 unique search terms per month and 90% of traffic derived from Google.

    On the 26th Jan in a matter of minutes traffic dried up to basically zero, all Google rankings vanished… Many thousands of page 1 positions there one minute gone the next.

    I do syndicate content, and all syndicated content contains a link back to my site as instructed. My site has Pagerank 5/4/3 pages, now when searching for anything from my site i get Wordpress “splogs” coming up page 1 and i’m not in the top 500 results.

    These “splogs” generally have PR0 Homepages and PR N/A pages my content is on and more advertising jammed in to their sites than you can poke a stick at.

    Even searching for “My Content Title - My Domain Name” brings them up first page and i’m back on page 4 of 4.

    How is this a good user experience?

    My site didn’t buy and sell links, complied with the Google Webmaster guidelines and was monetized by Google Adsense and was working to one day become a Premium Publisher. The site was my only source of income as i recently quit my job to concentrate on developing my site so i could work at home and look after my special needs daughter.

    What do i do? Pollute the web and join these automated “splogs” instead of doing things the right way. If Google’s “heuristics” don’t start working correctly by the end of February, my net connection and hosting will have to go in order to eat.

    I thought Google’s motto was “Don’t be Evil”?

  66. Colin McDougall Said,

    February 7, 2008 @ 2:34 am

    I have used duplicate content and have not been harmed whatsoever.

    Quite frequently when posting a blog entry I find it useful to quote another person so I will quote them and attribute them via a link to the orginal commentary.

    Plus I add my own unique thought and views on the post.

    Matt has been saying for what seems like a century now…

    If you are worried that much about duplicate content, you are very likely doing the wrong thing.

    Here is are really interesting thing I have personally found, the moment I stopped worrying about Google and focussed only on producing content for my visitor my traffic began to soar.

    Matt, that’s the best advice I EVER received for publishing content!

    Dupe content folks, algo analyzers et al… Don’t bother looking to exploit issues with the Google algo, they fix broken stuff and they fix it fast.

    Rather than poke holes in an algorithm why not roll your sleeves up and get busy with some hard work, find your voice in your niche and go forth and be interesting.

    Might sound like rubbish to you hardcore SEO’ers but it’s some simple advice that has worked very well for me.

    Be real, be useful and you will do well

  67. Lever Said,

    February 8, 2008 @ 7:13 am

    Matt, that’s interesting, but what if an original site, a .com for instance, publishes an article and then you set up a UK operation and have a .co.uk site with the same dupe content that is totally relevant to the UK audience too but you want to keep the US/UK sites well apart? Posting a link to the original articles in that case might not be appropriate. How would that be dealt with?

  68. Mircea Said,

    February 8, 2008 @ 7:52 am

    I agree with Colin McDougall..

    Personally I would prefer G could be more efficient in detecting the original source of an text, but for now it seems that this is not happening, even with the link to the source.

  69. Jess Said,

    February 8, 2008 @ 9:28 am

    Matt,
    Question, the company I work for has a blog at blog.skylighter.com. We write articles now and then for an email newsletter, these go into an article section on our website, as well as appearing in our newsletter archive. Since we started the blog we also post them on our blog. Would this create a duplicate content problem since the blog is a subdomain?
    Thank!
    -Jess-

  70. James Creare Said,

    February 8, 2008 @ 9:44 am

    I had a major problem with duplicate content on two of my web-sites. I had two web design sites that had similar but not exact copy. You could pick out sentences in both of the sites, but I was heavily penalised for this.

    The result was that each site seemed to alternate in the rankings, a bit like clark kent and superman, you never saw them together. Then eventually they both dissapeared off the rankings all together. It took me a long time to work out what was going on.

    After a long while of researching and changing the sites, I came to realise it was the duplicate content issue, and I saw just how sensitive Google is at picking it up.

    Once this problem was changed, took me a good few months for the sites to get back to where they were.

    Anybody else seen anything like this?

  71. Dave (original) Said,

    February 8, 2008 @ 6:46 pm

    Here is are really interesting thing I have personally found, the moment I stopped worrying about Google and focussed only on producing content for my visitor my traffic began to soar.

    Prudent advice. The sooner Webmasters figure out “SEO/SEM” is a myth the better off the search World will be.

    BTW, doesn’t the “S” in “SEO” stand for Snake and the “O” for Oil :)

  72. Bluegill Said,

    February 9, 2008 @ 7:57 am

    Duplicate content is also a huge search spam issue. Here is a website that its only indexed pages in Google are all duplicate content and it uses numerous techniques to spam Google from illegally copying others content, etc… and yet Google ranks them very highly for their respective keywords (number one in many searches):
    headlightso lution. net (just take out the spaces - I did that so I wouldn’t add to their already false, deceptive and inflated rankings.

  73. Dave (original) Said,

    February 9, 2008 @ 6:24 pm

    Bluegill, the site you mention has a TBPR of 2 and doesn’t rank for any of the main targeted terms. I have no idea if that is due to the reasons your suspect or others. Regardless, Google seem to have it right.

    Would I be correct in assuming the site is in competition with you?

  74. Rob Said,

    February 12, 2008 @ 3:29 pm

    I find this talk of duplicate content very funny. I have reported to google three times a case of duplicate content and after 6 months the 2 sites owned by the same people with exactly the same content is number one and number four on a popular search term.

    This is the search URL http://www.google.com/search?hl=en&c2coff=1&safe=off&rls=en&q=brazil+property&btnG=Search and it doesn’t take long to figure out that google is being spammed in a big way. This probably explains why the said company ranks no. 1 for nearly every term related to this search.

    Rob.

  75. Rob Said,

    February 13, 2008 @ 3:56 am

    I wish I hadn’t written anything here at all now. As soon as I said something my site got thrown onto the second page :-( .

    Maybe blackhat seo does pay off after all.

  76. Jitesh Ghushe Said,

    February 13, 2008 @ 10:11 pm

    Hi

    We run a scripts directory where developers submit there web scripts developed in various languages. Many times, we find that they submit the same content as they had done with other directory. How to deal with this type of content ?

  77. Michael Brandon Said,

    February 20, 2008 @ 1:13 am

    Yes, I have commented on this before, but its happening again…

    I am getting increasingly annoyed at how Google is handling duplicate content - ie scraper websites taking either the meta description, or the snippet around the search phrase of a top ranked page. Google is treating the original as worthless content, and wiping it from its top SERP’s.

    When I change the text of my pages, rankings come back. But for a clients site, it had only come back for less than a week before it was again dropped from the SERP’s because the scrapers had found it and copied the new content.

    Another client has had its home page dropped from the SERP’s for now a week. And although Google has a newly reworded cached copy dated 15 Feb and now 17 Feb, Google has not yet integrated that content into its indexes properly, so the SERP’s have not yet returned. Matt, why is it taking so long for your indexes to be updated? You have commented before that Google has been superfresh, but its now been 5 days since a cache, and that cached information is still not in your indexes properly. Pathetic!!!! A few weeks ago it was max a day from cache to integration into indexes. And now I see that a scraper site has yet again taken a copy of the new meta description…

    I other things to do and should not have to continually update the content of my clients and my sites. The text was “perfect” the first time - usability, good snippets for many phrases… To have to continually rewrite is not appreciated.

    When is Google getting rid of this collateral damage/bug in its algorithm? Can you at least acknowledge its existence as you have other bugs in your algos.

  78. Charles Weiss Said,

    February 26, 2008 @ 7:12 pm

    I’m not sure I follow this 100%. I’m currently working on a review site. One of the features I would like to offer is different versions of the reviews in different languages. Since my rating criteria are always the same, I do the write-up only once in english and then have the review translated into french by a professional translator. I would hope that having a structure like http://www.example.com/en/service_name.htm and http://www.example.com/fr/service_name.htm would not penalize for duplicate content while at the same time allowing for links in separate languages to the 2 languages (and url’s) of the review. It would be easier and more cost effective for me to run 1 site with all the languages rather than separate domains and hosting accounts for each. Should I rethink my approach?

  79. Elise Bauer Said,

    March 1, 2008 @ 1:00 pm

    I have a PR6 blog that has been around for 5 years, I get around 50,000 visits a day as a result of Google searches. My content is all original.

    What I am finding is that some of my posts are being copied by others and posted to community-content sections of large established sites such as Yahoo or Epicurious. When the posts are copied, a link to the original is often not included.

    When this happens, even if my post has been around and indexed for years, it will be removed from the Google index and priority will be given to the copy on the other site. I am assuming this is because Google trusts Yahoo.com and Epicurious.com more than it trusts my site.

    When I find that one of my posts has been dropped from the index, I do a check for a snippet of text and I almost always find that the post has been copied to one of these big sites. My only recourse at this point is to re-write the post, change the sentence structure of each sentence. Within a day or two after doing this, my post is back in the index.

    This is extraordinarily time-consuming. And it is a problem that will only get worse as these user-generated community content sites get bigger.

    So, I would have to agree with other commenters that the Google approach could be greatly improved.

    One thing that would make it easier for me to manage is to be able to easily tell which posts from my blog are no longer in the index.

    The Google webmaster tools let me see the gross number of pages Google sees from my Sitemap compared to the gross number of pages from the Sitemap that Google is indexing. For example, Google sees 754 pages from the sitemap and 751 of those pages are in the Google index. What about the missing 3 pages?

    At the moment, the only way I can figure out which pages are having problems with being indexed by Google is to search for them manually, one by one.

    If Google can already see the number of pages in my Sitemap, and can see the number of pages that are indexed, can’t Google also provide the pages from the sitemap that are not showing up in the index?

    And if Google can’t do that, do you know of another service that I could use that could do this?

    Of course, if the pages wouldn’t go missing in the first place, that would be ideal. But assuming that it will take some time for Google to straighten out its method, a way to help us manage through the pages missing from the index would be very helpful.

  80. John McKnight Said,

    March 3, 2008 @ 12:59 am

    What is Google’s stand on using parameters to create pages that basically regurgitate content on a site?

    I ask because I have seen quite a few sites lately that are tacking parameters onto a category page in a blog for example and then return different content if those parameters are present. In the cases that I have seen, they are taking each category link and creating 20 or more links that each show the same posts but in each is shown in a slightly different order. The result seems to be the creation of several hundred pages of fake/duplicated content on a site.

    Obviously I consider this to be a bad thing, am I wrong?

RSS feed for comments on this post

Got a webmaster-related question or suggestion that is not directly related to the topic of this entry? Instead of posting it here, your best bet is our official Google forum linked from http://www.google.com/webmasters/

Also, I pre-moderate first-time commenters. Please review my comment policy before leaving a comment.