Sitemaps Interview

Sebastian has posted a good interview with the Sitemaps team. The most useful tidbit (which I didn’t know until now) is that Google treats a 404 HTTP status code (page not found, but it may reappear) and a 410 HTTP status code (page not found, and it’s gone forever) in the same way. I believe that we treat 404s as if they were 410s; once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that’s probably a good call for the time being.

Most of the interview is not about HTTP status codes though, I promise. The only thing I’d change (we’ll see if Sebastian reads this) is to make the questions a different color from the answers so it’s easier to browse. :)

82 Responses to Sitemaps Interview (Leave a comment)

  1. Aaron Pratt

    Yep, one of the more useful interviews indeed, Sebastian is also a great guy, just wish he would update his about page on his website. ;)

  2. Helpful info Matt, thx for the link …

    …. but …. about that damn algorithmic mystery 301 inspired downranking !?!

  3. Aaron Pratt

    Joe, are you talking about this:

    I have a site that has been #1 in google for 2 years for the product I make, I recently put a 301 in place and it dropped to #2. Removed it, back to #1, put it back in place, back to #2.

    ?

    Ok I will shut the hell up now, good night!

  4. Aaron it may be related but we have a “downranking to Google oblivion” problem that started Feb 2 LAST year and persists. (1.5 million monthly Google visits to 15k) It was actually discussed at length in the WebmasterRadio interview plus Matt’s site reviews in Vegas plus I’ve asked some of the best in the biz. It appears related to extensive 302 problems and accidental duplication of pages but remains unsolved after a massive site reconfiguration we did to consolidate several state level domains into our oldest one using 301s. I’m not seeing improved results with BigDaddy, in fact our 301 pages are back in the index after they were removed by me with removal tool!

    Whoops…OT? Matt this could be considered a sitemaps post right since we’ve got a HUGE one for this massive site.

  5. Magnus

    Irritating, Google has done some updates during last week, and now the wrong information is in the index, unexisitng pages at my site are still in the index.
    Suppose it will correct itself, but as it was an update it is really strange. IMHO…

  6. Nigel

    Matt,
    How about you do one of these things:

    1. permit hidden keywords on a page in the keywords meta tag. You can determine if those keywords are ontopic for that page by comparing the visible page text with the corpus of web text you have. So suppose a website puts “chocolate” in the meta keywords and ‘candy’ in the page, you can calculate the probability of that ‘chocolate’ word being related to ‘candy’, and below a threshold and its spam, above a threshold and it’s the webmaster telling you what the page is about. You could even weight the relevence of the keywords tag relative to the page. So choc gets a 0.8 weighting and ‘viagra’ gets a 0.00001 because it’s off topic. That way sites can give you all the relevent text to a page without being forced to slap HTML text on it.

    2. Or you use a javascript enabled crawler and see what users see.
    That would also fix this problem:
    http://www.nigeljohnstone.com/archives/2006/01/rumbling_in_the_1.html
    You could follow any redirected links through CGIs, Perl scripts etc. to determine where the link actually points to.

    Instead of the we think this, nah nah nah nah we’re not listening nah nah nah comment closed approach. :)

  7. walkman

    “I believe that we treat 404s as if they were 410s; once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that’s probably a good call for the time being.”

    Matt,
    can you please double check on this? It seems like removed pages linger on the supplementals for years.

  8. Thank you Matt! I’ve changed the colors, why didn’t I think of that?

  9. Chris Purcell

    “404 Not Found: The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This [404] status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.”

    So it makes sense to treat a 404 as a 410, regardless of how commonly it’s done, because the HTTP specs allow 404 to replace 410 in any case.

  10. Matt,

    That sounds a bit aggressive on the 404′s based on my experience. I’ve renamed stuff (and not bothered with redirects), but seen Googlebot come by again several times. Actually makes sense to me – seems like the spider should “check” a few more times before it makes that decision things are gone.

    And if there are inbound links to the (now) missing page, shouldn’t Googlebot come back around to see if it has re-appeared? I realize this is just the spidering activity and may not be reflected in the index’s, but you might want to clarify where the “assume it is gone forever” is applied.

    alek

  11. Aaron Pratt

    joe – thanks for clarifying, that sounds like a mess dude, good luck with it. i am now concerned with scraper sites that have pr and trust, they are showing duplicates of my articles in the engines days before mine even appear if they ever appear at all.

    muhaha @ sebastian. :)

  12. Nice interview.

    I’ve also found 404′s persistently remaining in the index, (for a while it was my fault – 404 header was incorrect) but I fixed it and they’re still there.
    Also 301′s… I moved 120 pages 2 years ago, set up 301′s and removed ALL links to those pages.
    18 months later I removed the 301′s and BANG – 50% of the old urls re-appeared in the index… so I reinstated all the 301′s and now (6 months later) the old urls are STILL showing up in site:mysite.com and repeatedly showing as errors in my sitemap stats!
    (Makes me wonder about sites like wayback – obselete & incorrect urls still have live links on the web at places like that)!

  13. “Results 1 – 100 of about 2,240,000 for 404 “file not found”.” :D Keep saving those, Google, they might be good in a few years!

    Please let us know, Matt — how can we tell Google that we really mean 404 (like go away) when we say 404? There are so many people with this problem – old pages never dropping out of the index, junk accumulating, Google’s search results filled with pages that just don’t work (for a long time now). Is there a trick to saying 404 in a way that Google will accept as a final answer?

    With Sitemaps the webmasters now have a chance to look under the cover of Googles search engine – and they see those ever lasting 404′s — old links or whatever that point to some URL that has been long gone; Google shows it as an “error” and the webmasters have no way of correcting it, of even finding the link that pointed to that URL (if that link even still exists)…

    What can we do to make it easier for you to index our real content and forget about our past-life?

  14. Sebastian, thanks for changing the colors–much easier to browse! :)

    walkman, a supplemental result has to be recrawled for that 404 to go into effect. Since supplemental results go longer between recrawls, that’s the reason why some 404 pages linger. See below though; supplemental results need to be recrawled by the supplemental results Googlebot before they are processed.

    alek, you make a good point. We might try to recrawl a site several times if the page never loads, for example. I’m not sure if we do that with a 404 though.

    JohnMu, one thing to keep in mind is that we look for an HTTP status code of 404. For example, for the search [404 "file not found"], the #1 result isn’t a 404; it’s a gag site. See how it says “Please, for the love of god, try the following: … Go outside now” and it’s clickable to go back to the main site.

    A lot of the time, a site will return a 404-looking page (not a gag like the site above), but the HTTP status code will be 200 (as if the page was found just fine). We call that a crypto-404. We may look for certain phrases like “file not found,” but if you want to be safe/conservative, I’d doublecheck that your 404 pages really return an HTTP status code of 404, not 200. After we recrawl the page with the appropriate spider (e.g. for a supplemental result page, it needs to be crawled by the supplemental result spider), the 404 or 301 or whatever will be processed.

  15. Steven Jeffery

    Matt (or anyone else),

    I run a site that has (according to G’s seeming innacurate count) 580k indexed pages that is merging with another site that has 81k pages indexed. The two sites will be merging into a new third site. I see two ways of going about this that *might* work.

    1) As soon as the new site is up (the code and content is based largely off the new site), everything will be 301 directed that can be (probably all of the pages from the larger site and a small number from the smaller site). The concern is that it will take many months for G to index most of the pages on the new site and we will therefore lose most of our traffic for that period.

    2) Create the third site and wait until it is indexed fully and then do the redirects (per above). The problem I see with this is that we could be hit hard with the duplicate content penalty as there will be the exact same content on at least two sites.

    Thoughts?

  16. To extend Matt’s point:

    About a year after I first learned how to create custom 404s in ASP, I discovered that by default they return a 200 Status code. If you don’t code in the Response.Status = “404 File Not Found” in ASP, you’ll see a lot of those custom 404s in the index.

    I’m sure there are 404s that don’t meet this description, but at least some of them do, and Google can’t be faulted for that. This should provide at least a partial explanation for the behaviour.

  17. Ben

    I use my 404 pages to randomly generate a never-ending chain of pages that randomly construct useless content from an array of over 800,000 sentences, into a random number of paragraphs for an end result of 300 – 600 words per page.

    I call it blogger.com ;)

  18. James

    How to have an AJAX site indexed ?

    We’re doing another site (only for search engines) that has _exactly_ the same text/image than the AJAX site. The only difference is that you find links to pages which you do not in the AJAX site and that in the plain html site you got a javascript that redirects the user to the AJAX site.

    I would like to have some directions on Google’s part on how to have a AJAX site properly indexed and not delisted given that I am not trying to boost my page rank. I just want to be (fairly) indexed.

    Sorry to post this here but I could not comment this on the other thread.

  19. ben, good idea to have more pages full of content :D

  20. Philip

    I’m not convinced that Google does understand 404′s properly. I tried removing a url with the Google removal tool about 4 months ago and got a message back saying request denied for page removal. When I emailed Google about this they replied

    “Thank you for your note. We’re sorry about any confusion you’ve experienced. We’d like to reassure you that because this page currently returns a true 404 error, we can remove it from our search results. This removal should be processed within three to five days. We appreciate your patience.”

    but to this day the page is still in Google’s index.

  21. Paul

    Looking around the forums, there are a considerable number of comments stating ‘Since integrating sitemaps into our website, we have lost all of our positions’ and ‘our site has disappeared’.

    This makes one hesitant in creating a sitemap, for fear of losing either positioning or the site altogether, as far as the SERP’s go. On the other hand, if one doesn’t create a sitemap, you feel left behind, and not keeping up with the available technology.

    It’s a catch 22 situation, whereby you could win, or you could lose, according to many webmasters in the forums. Any comments would be appreciated.

  22. Sebastian

    Hello Matt

    Maybe you can pass the following idea to the sitemaps team:

    In an older post you mentioned google tries to contact webmasters when they violate the webmaster guidlines, while they still have good content. just someone in their website team did bad stuff.
    Can’t this be combined into sitemaps? Just like the crawling errors?
    Sitemaps could be turned in THE commuication channel from google TO webmasters since you already veryfied(by the empty file-thing) that
    the person with the sitemap account is responsible for that website.

    I have been severaly kicked by last autumns update and i would kill for some feedback from google why my pages now rank behind notorious page spammers, ppl with keyword.domain.tld urls and sites with nothing on it except links to affiliate marketing networks.

    regards
    Sebastian

  23. Hi Matt

    I know about “404′s” that return 200 – since it’s been one of the pushes that Sitemaps “forces” the webmaster to clean up, I’ve had my share of users asking about that (even set up my own explaination http://gsitecrawler.com/articles/error-404-200.asp and server test http://gsitecrawler.com/tools/Server-Status.aspx ;-) ). However, even pages that return “404″ as a return code are often kept “forever” in Googles index.

    It bothers the webmaster a tiny bit, but it really bothers the user who expects to find working links in a search engine like Google. Finally finding what you’re looking for, only to see it 404 in the browser is … depressing …

    What can the webmaster do to speed up Googles dead-page-removal? (without manually submitting each and every URL to the page-removal tool)? Or how about Google offering to go to the last cached page when it last saw that the URL 404′s? I’m sure there must be a way, with so many Google experts at work :-).

    John

  24. I thought this query interesting :-) — perhaps you see what I’m getting at?

    http://www.google.com/search?num=100&c2coff=1&q=site%3Agoogle.com+404+-groups+-directory+-answers+-maps+-sitemaps+-remove+-support

    I know that domain isn’t really *that* important, but if Google really indexes 500+ URLs from there that are certainly returning 404 as a result code, there must be something slightly off-balance? Or is there reason to a listing like that?

    John

  25. Oups, there is another Sebastian out there … if you post suggestions in the sitemaps group a member of the Sitemaps team will read it:
    http://groups.google.com/group/google-sitemaps

    Paul, I’ve investigated lots of these “sitemaps has tanked my site” reports and they all have one thing in common: no proof, no evidence, 100% speculation, always other causes which have nothing to do with Sitemaps. Don’t blame the sitemap when your site gets tanked. If a sitemap submits junk to Google, the standard procedure will handle it, in the same way as with add-URL-page submissions and links found elsewhere. OTOH creating an XML sitemap is a chance to look at a site’s structure and its contents. Grab any tool to generate a sitemap, double-checking the collected URLs before you submit it often leads to unexpected insights you can use to enhance your architecture.

    Sebastian

  26. Reik

    I checked headers of our custom 404 page returned from Tomcat. Its returning a 404 puh :) Wont these crypto-404 result in a lot of content duplicates? We have plenty of 404 URL’s in the index. The wont have any title or description, but they are still listed. Is there a rough rule of thumb, how long these invalid URL’s are gonna stay in the index?

  27. Aaron Pratt

    I am not getting a response from Matt, so maybe someone can explain: it appears that aggregated scraper “news” sites get credit for blogs without PR like mine. I would like an answer on this Matt, I really do not ask for much seeing that I am being used by the engines and scapers at the same time. Your content is also appearing on the same “new feed” if that is what it is called. I call it lame. Is my content all now considered duplicate?

  28. Here’s one i’ve never seen before: Meta tag spam. and apparantleyit’s working.

    This site has meta tags i’ve never even heard of..

    sagastume.com

  29. HI Matt,
    Great information.
    I reported a violation to G a couple of weeks ago about the NY Times violating Googles adsense policy, by opening ads in a new window.

    After checking again today, it seems they haven’t changed and are still having ads open in a new window.

    Is this not a violation?

  30. Dave

    “but if Google really indexes 500+ URLs from there…”

    There are only 220 pages there with supplimental results shown.

  31. Ryan, that’s yesteryears spam :-). Those tags are all valid (look up “Dublin Core Meta tags”) – but I doubt they’re doing them much good. If they’re ranking for any of that, then it will be because of other factors. (see seomoz’s excellent http://www.seomoz.org/articles/search-ranking-factors.php for more ideas).

    John

  32. Back in September I setup a custom 404 error page that would redirect users to the home page if a page had been deleted. All the talk about using 301′s instead of 302′s made me use a 301.

    Apparently Google had a lot of old urls in their index even though these pages were probably returning a 404 for a long time. I guess Google tried to crawl them once again and got a 301 back to the homepage.

    The jagger update dropped my site from the 1st page to the 10th page. I speculate I have a duplicate content penalty for using a 301 instead of a 404 that still exists almost 4 months later:( I read a thread somewhere about someone doing the same thing and being penalized, which is what made me realize I did something wrong and change it to respond with a 404 status.

    My site also has many more pages indexed with the non-www version of my site even though I am using a 301 to the www version. My www version starts getting more pages indexed and then it drops back to 1 page, my homepage. I have seen it do this several times.

    Google needs to do a better job of updating it’s index when it sees a 404 or a 301. It seems they are spending too much time on showing a smaller url when someone does a 302. MSN doesn’t have as good of results but it updates their index very frequently.

  33. Dave, perhaps “supplemental result” means “I shouldn’t show this, but…” or something else that I’m not aware of :-). I’m still getting 594 results for that query. Try these queries on G, Y and M – which engine seems to handle 404′s correctly?

    http://www.google.com/search?q=site:www.google.com+%22404+not+found%22+%22not+found+on+this+server%22&num=100&hl=en&lr=&c2coff=1&filter=0
    http://search.yahoo.com/search?p=site%3Awww.google.com+%22404+not+found%22+%22not+found+on+this+server%22&prssweb=Search&ei=UTF-8&x=wrt
    http://search.msn.com/results.aspx?q=site%3Awww.google.com+%22404+not+found%22+%22not+found+on+this+server%22&FORM=QBNO

    I just don’t see the reasoning in keeping results like that online. And I imagine Google’s results will be the first to have manual tweaks – so I don’t want to know how long other peoples 404s stay live …

    John

  34. loki

    When will Google REALLY “crack down” on cloaking? They let it exist KNOWINGLY in their index for all to see. Don’t believe me… check it out.

    Look at ANY of the medscape URLs on this crawl: crawl results

    NOW bookmark any of the medscape links, or cut and paste into a new browser… guess what, you will no longer see that content… you will be required to first create a free account.

    Medscape is clearly tricking Google. Up until a few weeks ago, their cache was different from what you actually saw. But is this fair? Here they are clearly looking at referrer and if from Google they deliver the content, but when you try to reaccess, well, you have to login.

  35. loki

    The point of this is that “cloaking” is in fact useful, and widespread in Google. Google’s spiders are machines — as such, Webmasters increasingly build content JUST for them… the result are pages that don’t read well for users. Similarly, as in the Medscape example, sites need a way to let users know what lies behind their members area… such uses of cloaking are quite legit, as is the BMW case.

  36. Chris Mills

    Matt,

    How can I identify the supplemental results Googlebot (mentioned above) in my logs? Is there a UA signature specific to this, or maybe some range of IPs?

    I see multiple Googlebot UA signatures in my logs, confirmed as coming from Google IPs. Even if one isn’t specific to supplemental results, I’d love to know what the differences between them are.

    Thanks.

  37. Aaron I’ve heard of that problem and was under the impression that when the bot sees the SAME content it has trouble determining the originating site. I’d guess it uses PR to make the call and since you are lower than the quoting sites your site is assumed to be the duplicator. But I’m not sure of this at all.

  38. Vladimir Nedovic

    Perhaps this is a violation as well, but it’s my way of getting around things. In other words, I need to comment on the removal of BMW.de from Google’s index, and even if this post is removed, someone will have to read it before.
    Looks like Google people are becoming too arrogant, behaving like Web police (don’t get me wrong, I believe in Google more than any other tech company, but some of the last steps are quite surprising). I mean, if someone has a way of boosting their score, why would you not just try to be smarter than them and lower it when a trick is encountered (which is what Google’s unofficial policy has been so far, right?) – wouldn’t that be more fair than setting the score to 0? To put it bluntly, how dare you punish anyone?
    Furthermore, and this is even more surprising, Matt said that they might be asking for information on WHO created the redirects at BMW.de. And what about all the privacy stuff that Google is so concerned about? Doesn’t that apply in their case? If I understood this well, this is quite schocking to me.
    Cheers everyone, and apologies for intruding your topic.

  39. Here’s one for the Matt Cutt(ing)s Scrap Book – your BMW story got you in to Metro here in the UK, a newspaper that everyone on the Underground gets free…

    http://www.ourfirstflat.com/matts-cutting.jpg

    Henry

  40. I saw that on the tube this morning reading the Metro – Matt’s just got famous in London!

    And they said he let this out unwittingly… Erm I think not.

  41. Are there any hints you can give on how many times Google will check a 404 to see if the document has returned…. Many times if a site has a 404 it is an oversite or problem that is begin dealt with… Other times it’s from a client not paying hosting and it takes them a few days to realize there site is down.

    I have often wondered how much time you have with a 404 to be corrected… Like many others I’m sure, if I realize my site is 404′ing on a page, I’m panicking to get it fixed as soon as possible…. Would be comforting to know just how many times it takes till the page is removed from the index because of the 404….

  42. Harith

    Hi Matt

    Just have few questions :-)

    Do you find the spam reporting of the friends at WMW of benefit to the work of Google WebSpam Team?

    Have you acted on a spam report from a webmaster concerning BMW doorway pages or was it just the work of your team?

    Have a great day.

    PS. any weather report about Frank’s sister :-)

  43. Hey guys, Great blog!

    I run a web hosting company for around a year or so, but it seems i have failed to achieve any good rankings for hosting related keywords which i know are very hard to achieve.

    Even though my main page has a PR5 and the rest are PR6 , google has failed to rank the site for even any slightly better keywords related to my type of hosting that i offer.

    Am i doing something wrong? or do i just need to keep on building upon my linking campaigns and improve the number of backlinks to my website.

    By the way, i read the story about BMW’S ban in the newspaper today :)

  44. John, I just checked the first couple of URLs on that Google query I posted — some of the 404′s are cached as the 404 page since last June :-).

    I second Hariths questions (except for the Franks sister part). How does a spam report go into Googles web-spam workflow? Does the number of reports for a site matter? Or can a single report cause a manual check?

  45. Remember when Google’s IPO was about to come through and they dumped a bunch of garbage in their index, so they could bost they had the largest index???

    Sounds like maybe thats why they leave a lot of these pages.

    I have not seen this anywhere but I just thought of it.

    Web 1.0 Web 2.0

    Large Index > Quality Index
    Lots of Backlinks > Themed BackLinks
    No Response from Google > mattcutts.com/blog :)

    If you haven’t heard of web 2.0 check out:
    http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

  46. Hmm, what about domains that have expired or have ‘gone down’?
    The search results still show the old cache version, even though the domains down.

  47. I kinda had a feeling that both error codes meant the same, but am now reassured. Thanks! :)

  48. Hey Matt, would you do a fun interview with me/my site ? It’s for librarians………. you did say librarians rock… lol. :P

    madferit [at] gmail.com

  49. Hi Matt

    We use a 302 to redirect nuwear.com to http://www.nuwear.com and we also use it to redirect the ip address to http://www.nuwear.com

    Is this bad as many people suggest it could be harmful to do it that way.

  50. Julie, your site is devious. I like how you mix the google ads in so they look like they’re part of your menu.

    It actually tricked me the first time I was there. You almost got a click out of me. Well played.

    I always wondered about that though… wouldn’t ads like that just make your visitors leave your site before they actually see any of your content? Not entirely useful if you ask me.

    I’m more inclined to place ads AFTER content.. that way, it gives them something to do after they’re done at my site.

    Anyway.. good luck with it.

  51. Mistah

    It would be great if Sitemaps did evolve into a way for webmasters to communicate with Google as Sebastian suggested.

    One area that springs to mind is where several domain names point to one site e.g. mysite.com and mysite.co.uk. A webmaster could use Sitemaps to tell Google which domain should be listed and which domains are just there to prevent domain squatters.

  52. Ryan –

    Well, it wasn’t devious intentionally. I was just trying to find a good spot to put them that wouldn’t interfere with the actual content. I don’t want to drive people away with a big fat old ad in the middle of a story. I personally hate that. I want to make a little $$ off of it, but not so much that I’m going to compromise the design. I tried ads at the bottom before, but they were never touched. The people keep coming back and telling their friends, and I’m getting lots of new members each day…. I got like 40 new members yesterday alone. I’ve got two volunteers translating the newsletters (and now working on the site) into Chinese, which is pretty cool. Now I just have to do some more updates to the site…and get working on our next monthly newsletter! This thing took off faster than I could have ever imagined. And I was just trying to come up with a better alternative to the “official” Google Librarian stuff, which if you ask me, is pretty sparse and not quite on target for librarians.

    I am going to assume that my target demographic is a bit smarter than average (due to the fact that most have MLS degrees and higher, and many members are upper-level management). I find that the Google Search generates more clicks on ads than the content ads. Librarians are curious and inquisitve, so they like to search. They can’t resist the search box. lol. Sounds funny, but its true.

  53. Going even more off topic :-) — Julie, you can improve your ads relevancy with small comment markers for the Adsense-Bot, see http://seside.net/google/2005/10/25/improving-googleads-relevancy ; I saw that the tuturials-page for example just listed tutorials for languages (probably catching the “view site in ….” tags on the side). As always, better relevancy of the ads = better ad targeting = better visitor experience = more clicks = more $ = happy Google + happy webmaster :-)

    John

  54. Ralf

    Matt, I just recognized the new robots sitemap tab nice done

    @Julie take a look at your copyright metatag – missing quotation mark

  55. Julie

    Thanks for the info guys. I kinda just threw the site together late Nov., so I may have missed a few details. Thanks for letting me know!

    Thanks for the adsense tip. I mostly do my sites for personal enjoyment/enrichment, not money, but hey, if I can make a few bucks doing something I like its all the better. I’ll give it a try :)

  56. Aaron Pratt

    Thanks Joe, that is exactly the way I see it and what this means is everything we do is stolen by someone else.

    It is obvious that Matt has more international things going on and we will most likely be ignored once again.

    My suggestion is if you have a blog start talking about these issues, maybe just maybe G and M will listen and answer our questions?

    HERE is the example of someone earning ADSENSE dollars by duplicating my posts in my blog.

    Good

    night! :(

  57. Interesting and certainly frustrating for you Aaron. Funny cuz I was also thinking it would be fun AND educational to start a blog/forum for people who are experiencing what appear to be technical difficulties with Google that hurt their legitimate rankings. By collecting complete contact info for legit sites one might help the site AND Google separate legit from spammy concerns.
    But we drift OT here so I’ll end with a sitemap rap.
    “da dot com be crappin,
    so i go sitemappin,
    but da G says
    “iz juz not not gonna happen!”

  58. Aaron Pratt

    newscraper trash
    spammers got cash
    what about us
    we went under your bus
    :(

  59. Aaron, I’ll do you one worse.

    About 3 months ago somebody took every article that I had ever submitted to article synidates (like go articles), changed the name and bio to information about him, and re-submitted them to the same article syndicates…

    And now he kills me in the search results for the title of an article that I wrote.

  60. Mistah – your point about several domains linking to one server is a good one and one I am struggling to use sitemaps to deal with at present. My .com site is well indexed and ranks well for several important keywords but I have recently purchased the .co.uk to try and get my .com show up in .co.uk only results too. I have the .com and .co.uk on the same server (parked?) and am using absolute .co.uk links within the sitemap for the first few .co.uk pages so Google should see .co.uk links and then the rest of my site uses absolute .com links to hopefully make G realise .com is a Uk based site – is there a better / recommended way to do this in sitemaps / google?

    Sitemaps are really useful but this type of thing and a form of communication to the big G would be even better :-)

  61. Aaron Pratt

    Ryan – Yes indeed.

    I do not blame Google, I blame all the folks who are content spammers, it is not about feeding the engines crap, it is about presenting your passion to the world. If you are into “beach erosion” blog about it, just keep an eye on Mr. SEO News, he still is having his way with your content I believe.

    Joe – Wurd!

  62. I shouldn’t complain. I have 2 sites that do nothing but take headlines from various other sites and aggrogate them.. However, I just display headlines, and put an actual real link to the webpage my crawler found it on.. I feel that’s fair.. AND a great way to get content for some of my sites.

    It’s the people who take the article word for word, or remove the links from it that piss me off.

    I have experienced the same with one of my articles though Aaron.. I posted it first on my blog, then syndicated it a month later.. Do a google search for the article title however, and my blog is nowhwere near the first page, but the syndicates without links are #1.

    I would say Google needs to do something about “where was it first”, but given the nature of it’s spider, it won’t really know… and if they checked dates on page, i’m sure people would easily start faking those.

    With syndicated articles being all the rage, there really is no good way for google (or anybody for that matter) to tell where it originated… I almost suggested some sort of pinging system, but even that could be spoofed / faked.

    I’m out of ideas… now this is bugging me.

  63. Hi Matt:

    SiteMaps continues to report several 404 errors on pages that never existed under this domain, but for a previously used domain that we 301′ed to this new domain over 2 years ago. Could these 301 redirect prevent GoogleBot from recognizing the 404 response on the current domain?

    Bobby

  64. Ryan in your case I’d think you could file a DMCA complaint. Copyright rests with the author even if you don’t file for it. Often just the threat of this will get the content removed or, sometimes better, attributed to you and your site.

  65. Aaron Pratt

    Joe – Yes indeed, I am going after people I feel have done me wrong and just in one day I see their sites are either down or the content has been removed. Do not be afraid to stand up for yourself. Just try to refrain from threatening peoples wives and children but anything short of that is fine, hehe ;)

  66. Lee

    Ryan, if you found an article on Site X, why would it matter? Don’t you think the author deserves a link to his/her site?

    By linking to the article, you’re not helping your fellow online marketer.

    As for the DCMA, does that apply to thieves who live in other countries? How can we tell Google that someone stole our article and infringed upon our copyrights? I think Google, not as a policing official, but as a concerned citizen should take complains like this seriously.

    Aaron Pratt is correct in his assertion that PR shouldn’t affect who Google thinks is the originating resource. The true originator should get the credits deserved regardless of how old a site might be or how many links point to that site.

  67. Lee, the problem is that PR rules crawling, so high-PR aggregators get their stuff crawled more frequently than low-PR content sites. In Aarons case the aggregator got the “source bonus” simply because his page was fetched before Aaron’s.

  68. Matt,

    Turns out that a network admin, while trying to create a “temporary” ftp access credential for a site accidentally took it offline (yeah, i know) for about 42 minutes. If the 404 error & the 410 error are the same, and if Googlebot came around during those 42 minutes, then how bad could it be??

    I’ve noticed that the site has dropped significantly in the rankings. Is this just a coincidence? Or is there something that I can do to rectify the misunderstanding??

    Thanks!

  69. Mel

    Anyone here have access to their website error pages.

    You know where you can replace the standard 404 error page with WHATEVER page you want?

    If so, how will google view an error page that has content with adsense on it

    any different?

  70. Matt and the Sitemaps Team,

    I’ve got a real head scratcher for you, but first I want to thank Google for creating the new Sitemaps interface, for without it this problem would never have come to my attention.

    In the Sitemaps interface I see what Google shows as our web pages’ most common words. This is a tremendous help. The number one word is “lasik” and this makes sense as we certify Lasik doctors and talk a lot about Lasik surgery. The number three word is “surgery” and a count of it’s occurrence on our web pages indicates this is about right. The word found second is “Viagra”.

    Say what?

    In natural language the word “Viagra” is found a total of two (2) times in our entire website. (People who use Viagra want to know if it is contraindicated for Lasik and we have an article about it.) If I add all of the source code (viagra.gif, lasik-viagra.htm, menuy, sitemap, navagation), “Viagra” comes up a total of 9 times in 6 pages. We have about 640 pages total, not counting image files, etc.. Why would “Viagra” be number two when it should not even be in the top 20?

    If that isn’t odd enough for you, the Sitemaps interface shows that “phentermine” is the sixth most common word and “xanax” is seventh. These words are NOT on our website anywhere. Not in natural language, and not in code.

    How can Google believe that three of the seven most common words in our website are two words do not exist and one that appears about 0.0001% of the time?

    We use Google as our local search engine. I search for “phentermine” and “xanax” on our website and neither of them come up. Google thinks they are very common, but Google can’t find one instance of their use.

    We have no black hat here. We have nothing to do with these products and I’ve already told you the instance where “Viagra” is used.

    What is worse is that these three terms are moving up. They were lower down the scale of common words just a few days ago. This week Google thinks we have a higher instance of words we don’t even have in our website. I think this may explain why we have dropped from a high average of 26 (bumped 16 one day) in “lasik” search down to 47 depending upon the dance. We say we are about Lasik, but Google thinks we are about common “bad neighborhood” terms.

    How do I get this resolved and what could possibly have caused this? It makes no sense to me at all.

    Feel free to stop by our website and email me:

    http://www.USAEyes.org
    glenn dot hagele at usaeyes dot org

  71. mac

    When is matt cutt’s interview on telivision?

  72. It is a pitty that Matt missed your question Glenn. I am also intrigued by how Google sees local rank.

  73. Thank you Matt, you are my favorite resourse for news, tips and lessons, for youngster webmasters all over the world.

    Great Thanks

  74. Ricky

    hello Matt,
    I have a query? Recently i have notice in my sitemap something very unusual. In sitemap one field is web crawler and part of that field is Not Found pages that show some error messages when crawlers do not find the pages or for some misspelled pages. But I have noticed that these results shows the 404 error from the pages which are not from our website that results comes from some articles websites and over there my url’s typed with some spelling mistakes in them. May be the referer made a mistake in typing them. Is this affects on my ranking? It graduaaly increase my errors for Not Found pages, pages athat are not the part of my website domain but these are pages from some other websites. How can i resolve the problem? IS there any method ot it is a google sitemaps tool’s weakness? The number of error messages increase day by day in my sitemap results???
    Please give me a solution.

  75. Hunh, interesting. I didn’t even KNOW about the 410 code before reading this! Thanks for the heads up.

  76. I didn’t know about the 410 code before too. I googled it and found this info:

    http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40217

    Thanks Matt!

  77. jen

    Does google still treat these the same? Is there any advantage to using one over the other with regarding to SEO on a site?

  78. I didn’t know about 410 either…

  79. JS

    I’ve never seen a 410 code.

Leave a Comment

Your email address will not be published. Required fields are marked *

*

If you have a question about your site specifically or a general question about search, your best bet is to post in our Webmaster Help Forum linked from http://google.com/webmasters

If you comment, please use your personal name, not your business name. Business names can sound salesy or spammy, and I would like to try people leaving their actual name instead.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php