Don’t end your urls with .exe

Sometimes at a conference people will ask me “Does it matter what extension I use for my pages? Does Google prefer .php over .asp, or .html over .htm?” And my answer is “We’re happy to crawl all of these file extensions. It doesn’t matter what you choose between any of those.”

Usually I also try to insert a reminder at the end of my reply such as “But there are some file extensions that are mostly binary data, such as .exe, where the vast majority of the time the data would be meaningless blobs, so there are a few extensions to avoid. If your files are named example.dll or example.bin and you don’t see Google crawling pages with that file extension, I’d recommend changing your file extension to something else.”

There’s a simple way to check whether Google will crawl things with a certain filetype extension. If you do a query such as [filetype:exe] and you don’t see any urls that end directly in “.exe” then that means either 1) there are no such files on the web, which we know isn’t true for .exe, or 2) Google chooses not to crawl such pages at this time -- usually because pages with that file extension have been unusually useless in the past. So for example, if you query for [filetype:tgz] or [filetype:tar], you’ll see urls such as “papers.ssrn.com/pape.tar?abstract_id” that contain “.tar” but no files that end directly in .tar. That means that you probably shouldn’t make your html pages end in .tar.

The SEOmoz folks stumbled across this when they had a url that ended with “/web2.0″ . It looks like previously they had a url looked like “/web2.0/” (note the trailing slash), which we were happy to crawl/index/rank. But when their linkage shifted enough that “/web2.0″ became their preferred url, Google wouldn’t crawl urls ending in “.0″, so the page became uncrawled.

Even though urls ending in “.0″ are often binary and therefore end up getting dropped later in our indexing pipeline, it’s always good to revisit old decisions and respond to feedback by running new tests. So just in the last day or so, we switched it so that Google is willing to crawl pages that end in in “.0″. This will help the small number of pages out on the web that want to serve up HTML pages with a “.0″ extension.

You can see the results trickling into Google with a bunch of “X hours ago” fresh results:

0 file extension

So my quick takeaways would be:
- Why Google doesn’t crawl some filetype extensions (when we’ve seen good evidence that the extensions are mostly binary or otherwise not-very-indexable files).
- An easy was to use the filetype: operator, so that you can decide whether to avoid a particular filename extension yourself.
- Google is willing to revisit old decisions and test them again, which is what we’re doing with the “.0″ filetype extension.

I hope that helps a few people who are considering unusual filetype extensions of their own. :)

Related Posts:
  • Infrastructure status, January 2007
    Okay, it's been a while since my last infrastructure status report, so I'll briefly cover the things that I know are going on. The executive...

32 Comments »

  1. chuckallied Said,

    June 13, 2008 @ 9:59 am

    Matt, I don’t know about the feasibility of this, but I wouldn’t mind seeing a “files” version for Google (like images, video, etc) that indexes all these things.

  2. Michael D Said,

    June 13, 2008 @ 10:03 am

    It seems like wordpress converts titles to friendly urls but I’ve avoided using .,!?# and others for thought of urls not getting indexed properly. Maybe over paranoid but that’s just me.

  3. Scott Said,

    June 13, 2008 @ 10:42 am

    “Google is willing to revisit old decisions”
    Matt,
    Anything new from my old question?
    thanks,
    Scott

  4. Bernd Said,

    June 13, 2008 @ 10:57 am

    One of the prominent users of strange extensions is Ars Technica with .ars. I always thought that it would be logical to use a “normal” extension like .html or .php to avoid any presentational or crawling related problems.
    But there should be some differences between .html and .php, right? .php is not as often crawled as .html because .php mean “dynamic” and stressful for the webserver. Since mod_rewrite .html can also be .php. So why does Google look at the extension? The MIME-type should be more useful.

  5. Jarlskov Said,

    June 13, 2008 @ 10:57 am

    chuckallied >> Can’t you just use the search term Matt used in the example? So if you want to find files named foo.bar you search for:
    foo filetype:bar
    I guess that should solve your problem :)

  6. Matt Cutts Said,

    June 13, 2008 @ 11:00 am

    “But there should be some differences between .html and .php, right?”

    Bernd, I’d contend that there shouldn’t. Whether a page is dynamic or not isn’t really indicated by the extension. Plenty of “.html” pages are actually dynamic or generated on-the-fly by the server, for example.

  7. g1smd Said,

    June 13, 2008 @ 11:23 am

    Nice find by the SEOmoz folks.

    I wonder if there were any other people previously wondering why their Web 2.0 pages or their stuff about radio stations, or different software versions, wasn’t ranking. Now we know why.

    It will be interesting to see how many new pages are ultimately found. I’m guessing it will be at least a few thousand.

    I have used .0 in a URL on several occasions, but it has always been either as a folder name with trailing slash, or else as a filename with a regular .html (or .php, or whatever) extension on the end.

    I’ve always been a bit wary of extensionless URLs.

  8. Russ Jones Said,

    June 13, 2008 @ 11:47 am

    And for the love of God, don’t use a custom file extension of .seo

    http://www.google.com/search?q=ext:seo

  9. g1smd Said,

    June 13, 2008 @ 11:59 am

    Oooook. filetype:0

    Heading for half a million results already....

  10. g1smd Said,

    June 13, 2008 @ 12:00 pm

    But what about using the .sem extension then?

    http://www.google.com/search?num=100&q=ext%3Asem

  11. Scott Kavanagh Said,

    June 13, 2008 @ 12:26 pm

    That explains an issue I had with a clients site a while ago.
    Thank you for this, now I know what to do, I had virtually given up on ever finding the problem.

  12. john andrews Said,

    June 13, 2008 @ 2:52 pm

    Reminds me of the old days of SEO, but in reverse. So what is the file extension that causes a #1 rank?

  13. Philipp Lenssen Said,

    June 13, 2008 @ 4:04 pm

    > There’s a simple way to check whether Google will
    > crawl things with a certain filetype extension.

    Google Code Search crawls zip files even if they don’t show up using that test...

  14. Doug Heil Said,

    June 13, 2008 @ 6:12 pm

    The interesting thing to me about all of this is the site would not have had any problem at all if they would have stuck to their original url with the trailing slash. It’s SEO 101 to use a trailing slash for a page that is the index page of a folder.

    Good stuff though for Google to revisit the url ending in zero.

  15. Matt Cutts Said,

    June 13, 2008 @ 6:21 pm

    john andrews, surely that would be the .googlepray extension, to mirror the googlepray meta tag? ;)

    Philipp Lenssen, you’re correct, but I was talking about the main web search. Our code search is willing to delve into things like zips and tarballs to find code.

  16. Multi-Worded Adam Said,

    June 13, 2008 @ 9:41 pm

    I’m somewhat curious, Matt (although I suspect I know some of the answer).

    http://www.google.com/search?hl=en&rls=GGLL%2CGGLL%3A2008-07%2CGGLL%3Aen&q=Texas+Land+40+acres

    If .dll is one of those extensions big G has trouble with, how is it that eBay has a result with a .DLL extension in there? Is it because of the MIME type, as Bernd suggested?

  17. purposeinc Said,

    June 14, 2008 @ 2:32 am

    Matt,
    I love it when you give examples that are that specific.
    As you know I am a strong promoter of Open Source Google!
    I never would have considered this one way or the other, and then could have ended up with indexed pages. Thanks for the exactitude!

    I believe it was g1smd who suggested years ago at Webmaster World to end our URL’s with a /

    I have been trying to do that ever since.

    Now it makes more sense why.

    dk

  18. Harith Said,

    June 14, 2008 @ 9:12 am

    Matt,

    This is a great documentation. Should we expect to see it added to GOOG quality guidelines (maybe under;: Design and content guidelines)?

  19. starman_uk Said,

    June 14, 2008 @ 3:34 pm

    So would this have had an effect on SMF forums with urls ending in
    /index.php?board=6.0
    /index.php?topic=216.0

  20. Kristofer Said,

    June 15, 2008 @ 3:02 pm

    Hey,
    I and a few others just realized that “Google reserves the right to Terminate your account at any time, for any reason, with or without notice.”. For example, if you get in trouble for some reason with Adsense, which can happen even though you don’t click yourself, you might get your GMail account suspended?

    The though of losing my Mail really scares me and I would really like to hear if that can be the case, in the example above. I would think twice before having my company e-mail outsourced to Google if it works like that.

  21. Matt Cutts Said,

    June 15, 2008 @ 6:44 pm

    M.W.A., that’s because the url is “cgi.ebay.ca/ws/eBayISAPI.dll?ViewItem&item=270224738405″. If the url were “cgi.ebay.ca/ws/eBayISAPI. dll” then we wouldn’t have crawled it, but since there was something else in the url that followed the .dll, Google was willing to crawl the url. Note that I said Google won’t crawl the url if ends directly in .exe or .dll or whatever. So http://www.example.com/page.exe we wouldn’t crawl. But http://www.example.com/page.exe?whatever we probably would be willing to crawl.

    Harith, my hunch is that this is a little detailed for our webmaster guidelines, but maybe we could turn this post into some sort of official documentation somewhere.

    Kristofer, I’ve never heard of that happening. My hunch is that Google has stuff like that in our TOS so that if email/CAPTCHA spammers are creating bunches of accounts, we can disable those. So I wouldn’t worry about this, in my opinion.

  22. Peter (IMC) Said,

    June 15, 2008 @ 7:13 pm

    Matt,

    In that example you´re giving, FM_92.0 is now indexed. But what about 92.1
    92.2
    92.3
    etc.

    and for that matter you may also expect:

    92.10
    92.20
    92.30
    etc.

  23. Sint Said,

    June 16, 2008 @ 3:05 am

    Hi Matt,

    Thanks for clearing some things up again.

    I assume Google not only keeps these specific kinds of files (executables, binaries) from the SERP’s because it is possible they aren’t ordinary web pages, but also because downloading these files could be a security risk to users, when an executable contains hazardous code.

    Does Google only look at the last characters of an URL, or do you also look at content types etc? Because else one could still get a file into Google by linking to it like http://www.domain.com/somevirus.exe?hello=world (adding a random parameter).

    It seems logic to block some extensions of files that are, according to the web standards, no part of a web page, but ignoring the extension if a GET parameter is provided sounds a bit inconsistent to me. Especially when there are better ways to determine of which type a file is.

    Maybe you could tell a bit more about why Google is handling the extensions this way?

  24. Ankit Said,

    June 16, 2008 @ 8:03 am

    matt seems like mozzers are back with there web2.0 page in Google index !!
    http://ankitrawat.com/blog/seomozz-got-its-web20-page-back-in-google/
    what u have to say abt this ?

  25. Mike Irving Said,

    June 16, 2008 @ 9:21 am

    Hi Matt,

    Interesting Article.

    You briefly mention use of the trailing slash in filenames, i.e. http://www.somesite.com/some-content/

    I have used various URL Re-Write methods in the past, some using the trailing slash and some not.

    Also, sites with no URL Re-Write in place but with, for example, default.aspx or default.php placed in various folders are quite common so as to achieve “re-write effect” URLs without any actual rewriting.

    I am just wondering whether “trailing slash” filenames are prefereble over non slash URLs, or vice-versa.

    I suppose “trailing slash” URLs will be one character longer, accounting for the slash, but would that have any advantage/disadvantage when it comes to search ranking?

    Would a “trailing slash” page be given less relevance as it could be interpreted as looking as if it belongs a section below the parent document?

    Interesting things URLs, they vary so much from site to site.

    Mike.

  26. BradleyT Said,

    June 16, 2008 @ 10:16 am

    What about no extension and no trailing slash? This is the way Drupal creates pages when you turn on friendly URLs.

    I guess since a site:drupal.org query returns 1.4 million results with many of them having no trailing slash and no extension it’s fair to assume Google is OK with this?

  27. Matt Cutts Said,

    June 16, 2008 @ 12:40 pm

    Ankit, I think that’s a result of Google responding to feedback and trying to crawl urls that end in “.0″; we’ll see how many binary packages come in as well before making a final decision.

  28. chuckallied Said,

    June 16, 2008 @ 3:10 pm

    >>Jarlskov>> I could, and do, but in the same way Google image search displays image results differently, it’d be nice to see a Google File search that likewise includes sorting and display features specifically designed for files.

  29. g1smd Said,

    June 17, 2008 @ 4:25 pm

    *** use of the trailing slash in filenames ***

    No. The trailing slash is NOT used for a filename.

    Trailing slash is the correct canonical form for a FOLDER.

    Redirect to that form, from the one without it.

  30. Tim Rimmer Said,

    June 17, 2008 @ 8:57 pm

    on the topic of statics versus dynamic pages.

    I am surprised to learn that Google does not use the page extension as an indicator of quality. Even if a small one.

    Obviously anyone can hide a dynamic page through mod rewrite etc but still I would have thought that overall on the web there is more webspam in php, asp, aspx, cgi etc pages.

    Google, having the worlds best data set to test these things could do a correlation between extensions and quality and I am suprised that there is not a significant lean towards /html or no extension having better quality pages.

    Is there really no difference in the quality of pages with php vs htm or is it moreso that the colateral damage to good quality dynamic sites is too high?

  31. Multi-Worded Adam Said,

    June 18, 2008 @ 8:09 am

    M.W.A., that’s because the url is “cgi.ebay.ca/ws/eBayISAPI.dll?ViewItem&item=270224738405″. If the url were “cgi.ebay.ca/ws/eBayISAPI. dll” then we wouldn’t have crawled it, but since there was something else in the url that followed the .dll, Google was willing to crawl the url. Note that I said Google won’t crawl the url if ends directly in .exe or .dll or whatever. So http://www.example.com/page.exe we wouldn’t crawl. But http://www.example.com/page.exe?whatever we probably would be willing to crawl.

    This explanation makes some sense. I have always used the “definition” of a URL as provided by ASP’s ServerVariables collection, which doesn’t include the querystring (which is a separate animal).

    But this leads to another concern, not so much for myself personally but for the average user who may not be so tech-savvy (i.e. the one Sint asked). By this explanation, theoretically http://www.some-domain.com/some-virus.exe wouldn’t be indexed, but http://www.some-domain.com/some-virus.exe?this=a_random_querystring would. I know a lot of people who don’t bother to look at the extension of a link before they click on it (in fact, I would say the vast majority wouldn’t) and may well stumble upon something they’re not supposed to.

    It’s also theoretically possible, through cloaking, to present “valid” HTML content to a bot and something completely different to what is perceived to be a human.

    I think Sint has a good question...obviously the querystring as a standalone cannot be considered. So what else is?

  32. Kris Said,

    June 19, 2008 @ 10:11 am

    Hello Matt,

    I have still a question about urls. I know now that .html or .php are the same, but what about pages without an extension?

    example:
    http://www.domain.com/product1.html
    http://www.domain.com/product1

    Which do you suggest we use for our websites?

    Thanks!

RSS feed for comments on this post

Got a webmaster-related question or suggestion that is not directly related to the topic of this entry? Instead of posting it here, your best bet is our official Google forum linked from http://www.google.com/webmasters/

Also, I pre-moderate first-time commenters. Please review my comment policy before leaving a comment.