Q: Why doesn’t my site show in SafeSearch?, or “Do you hate Metallica?”

We recently received a question about why www.metallica.com doesn’t show up in SafeSearch (e.g. note the results for [metallica] and then compare with adding the parameter “&safe=on”.

The answer is in their robots.txt. http://www.metallica.com/robots.txt has

User-agent: *
Disallow: /

This means that we have crawled zero pages from www.metallica.com. If you look carefully, we show a description of www.metallica.com from the Open Directory Project, not any snippet from their page. SafeSearch can’t return uncrawled/empty documents (unless they have been whitelisted), because the documents might turn out to be unsafe. Hopefully it makes sense that SafeSearch shouldn’t return a document to a user if we don’t know what that document actually has in it–what if the document had porn on it? So while we could whitelist metallica.com, the correct answer is for their webmaster to allow us to crawl their site.

It’s easy to see how this misunderstanding could happen. For example, if you do the search [Nissan Motors] you get back a pretty useful snippet: “Manufactures automobiles including passenger cars, buses, trucks and related parts and accessories. (Nasdaq: NSANY).” It almost looks like we’ve crawled the page–but we haven’t. Nissan also forbids all search engines from crawling their site with a robots.txt, so that snippet also comes from the Open Directory Project.

Several years ago, the Library of Congress had a robots.txt that didn’t allow any search engines (they do now), so it www.loc.gov wouldn’t show up in SafeSearch. So we changed it so a whitelist can trump an uncrawled document in SafeSearch. We studied the .gov domain and didn’t find any pornographic content (the closest we found was the Kenneth Starr report).

P.S. Metallica isn’t in my regular playlist, but I did watch Some Kind of Monster recently. It’s a much more nuanced view of the band than the Napster Bad Flash parody.

39 Responses to Q: Why doesn’t my site show in SafeSearch?, or “Do you hate Metallica?” (Leave a comment)

  1. They’ve definitely designed one SE-unfriendly site there. Now I know what James Hetfield meant by “InterWeb Bad!!!”

    Still they have been (poorly) indexed by many search engines, robots.txt or no, though the fact that page titles and descriptions are largely lacking indicates that spiders aren’t actually crawling the site.

  2. This is a good post that touches on the issue, and can go much deeper. Often a misconfigured robots.txt is the beginning of confusion for webmasters. Once the bot is directed in an unexpected way and pages get indexed, they may remain in the index long after robots.txt is “corrected”. Typically this happens in the very beginning, before robots.txt has been considered. Ultimately the webmaster inspects the Google index, sees pages that “shouldn’t be there” ARE there, and “newer, more important pages” are NOT indexed, and the confusion (and frustration) mounts.

    Thankfully Google does a pretty god job of providing resources for removing pages, adjusting robot directives, and informing webmasters. You’re helping further by posting this topic here.

  3. Just to build on this if we use robots.txt to exclude dynamic pages such as http://abcnews.go.com/Entertainment/wireStory?id=1036029, do we take out specifically the page named or do we run the risk of taking out all the pages that start with wirestory?…. or will it take out any pages starting with that url – such as http://abcnews.go.com/Entertainment/wireStory?id=10360291
    http://abcnews.go.com/Entertainment/wireStory?id=10360292
    etc etc….

  4. Can you provide an example of a Google friendly robots.txt file? I’m just worried I have the same robots.txt file Metallica does.

  5. Here is the start of my robots.txt file:

    User-agent: *
    Disallow: /

    User-agent: Googlebot
    Disallow:

    It seems to work fine for me, I get indexed quite frequently by Google. If you want to allow images, duplicate the second entry, but change the user-agent to ‘Googlebot-Image’. I also allow the Internet Archive (user-agent: ia_archiver).

  6. Matt,

    And here I thought most algo-engineers thought Metallica was a bat-head-eating clan of devil-worshippers. Here I see you have included reference to [“metallica”] in almost your first post.

    Kudos to you, and I’m sure the soft-spoken Lars would agree. I’m sure you remember Lars; He was the voice of the corporate music industry against P2P’s. Although not my first choice, being a drummer myself I have to reward his input.

    Of course I still have LimeWire, Shareaza and eDonkey, (too much spyware and viruses on Grokster), but I also still buy cd’s mostly based on 30 sec. snippets. So although misinterpereted, he was only looking after his wallet…so Kudos to you Matt for preserving the net, and Kudos to Lars for getting so much publicity for P2P’s. 😉

  7. ps. napster who?

  8. okay just one more thing: Check your robots.txt file formatting here:
    [“http://www.sxw.org.uk/(remove this)computing/robots/check.html”]

    chow!

  9. I really hope I have nothing similar that has Metallica…can anyone explain me how this works?

  10. Hello Matt,

    When I submit my robots.txt through the automatic removal console, I get the following message:

    “URLs cannot have wild cards in them (e.g. “*”). The following line contains a wild card: DISALLOW/*PHPSESSID”

    But on Google guidelines we can read this:

    “Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include “*” to match any sequence of characters”

    BTW, my robots txt is:

    User-agent: Googlebot
    Disallow: /*PHPSESSID

    –pb–

  11. I am always scared to make robots.txt even though I have it… http://www.asxvideos.com/robots.txt is everything ok there? I have disallowed few parts. also this Disallow: /*PHPSESSID what does it do? Does it trunkate the long ugly urls?

  12. Why does it take so long to index a new site in Google? My site http://www.service-local.com has been launched six weeks ago but it showed the homepage only for the first five weeks.

  13. Anh

    You need to build more back links to get indexed in google otherwise you will never get picked up by them..

    View some of Matts other posts, you will find them very helpful

  14. Anh

    Yes, you need to work on getting backlinks to your site. It can take several months for a new site to be listed on Google. You may want to read about the Google sandbox.

  15. Some bots will grab your pages without reading your robots.txt, so if you don’t like the bots don’t grab your page, such as http://www.domain.com/dontgrab/1234.html, so you can use a javascript to show the link

  16. Hey Matt,

    What’s the best way to set up a robot.txt file to prevent duplicate content with wordpress? If you do find time to answer this question, I’ll greatly appreciate it.

    Paul

  17. Just place your links on high PR websites and you should be fine after.

  18. I think it is important whether high PR website’s content is related to your site/

  19. sandbox expirience sharring, is there / here a place for this ? since everything I read on sandbox so far not exactly what happening to my site, maybe by learn from other expirience I will be able to draw a better conclusion. If you do not have it I suggest to consider creating a place where peole can share these type of expirirences, I thin it will be useflul

  20. Chung, Just placing link in a high PR website might work but not the way you want it to work. I have been commenting on different blogs with my name and website as I have done here. When you click on Zafar, it takes you to me site. If you search on MSN for Zafar, the first result is http://www.vbuniverse.com but on my site does not mention word Zafar.

    My point is, when you place your link, make sure that wording of the link is one of the keywords related to your site. For example, instead of saying my name Zafar, I should be using [Latest Smartphones] or something which is related to my site.

  21. Yes relevancy is what is the most important thing not the PR I think.

    Akash

  22. Who uses safe search anyways? But this explains why some of my sites aren’t listed :D.

  23. you like what ??

  24. Robots files need to be correct… use google sitemaps, helps out a ton

  25. Google sitemap does help a lot on my web.

  26. Yea it helps me index really fast.

  27. Al tough robots.txt is a small file but it plays a very vital role in getting indexed in search engine. I was also not aware of these things untill and unless I was here on this page

  28. How do I modify robot. txt in Yahoo store ?

  29. Just curious, how many people actually use safe search…..

  30. It is interesting history how robots.txt is gived birth to and how far-reaching it is.

  31. Haven’t even heard of safe search before. Nonetheless, I’ve learnt something new.

  32. Have you heard about shenzhen the chinese sillicon valley?
    מצלמות אבטחה

  33. What the heck? I used to be a huge fan of Metallica but they are just getting weirder and weirder. This seems very counterproductive to me. Oh well.

  34. metallica- bunch of stupid dingalings

  35. Not right, Metallica rocks 🙂
    Btw, nice post Matt, I like how you explained it…

  36. Matt,
    can you tell which level will be best level to set for web crawling? My site has more than 5 lakh pages.

  37. User-agent: *
    Disallow: /photogallery/
    Disallow: /documenten/
    Disallow: /media/
    Disallow: /Links_archief/
    Disallow: /pdf documenten/
    Disallow: /dswa_originaliteit.htm

    I do it in this way to stop all searchengines, any one else similar positive experiences?

  38. Lets see, hmmmm
    Hippie on the outside, corporate zombies on the inside, pinheads compared to the grateful dead, and about 200 million dollars poorer as a result of resisting a kid with a bootleg version of Visual Studio… hmmmm

    Sounds like the jing a ling ling of….

    Metallicough,

    Didn’t they know Steven Tyler invented his own bootleg black market release after every show?

    Metallica sells out everywhere, so to speak LOL

    So guess the deal with TechN9ne is off

  39. I like the Metallica reference.

css.php