Don’t end your urls with .exe

Sometimes at a conference people will ask me “Does it matter what extension I use for my pages? Does Google prefer .php over .asp, or .html over .htm?” And my answer is “We’re happy to crawl all of these file extensions. It doesn’t matter what you choose between any of those.”

Usually I also try to insert a reminder at the end of my reply such as “But there are some file extensions that are mostly binary data, such as .exe, where the vast majority of the time the data would be meaningless blobs, so there are a few extensions to avoid. If your files are named example.dll or example.bin and you don’t see Google crawling pages with that file extension, I’d recommend changing your file extension to something else.”

There’s a simple way to check whether Google will crawl things with a certain filetype extension. If you do a query such as [filetype:exe] and you don’t see any urls that end directly in “.exe” then that means either 1) there are no such files on the web, which we know isn’t true for .exe, or 2) Google chooses not to crawl such pages at this time — usually because pages with that file extension have been unusually useless in the past. So for example, if you query for [filetype:tgz] or [filetype:tar], you’ll see urls such as “papers.ssrn.com/pape.tar?abstract_id” that contain “.tar” but no files that end directly in .tar. That means that you probably shouldn’t make your html pages end in .tar.

The SEOmoz folks stumbled across this when they had a url that ended with “/web2.0” . It looks like previously they had a url looked like “/web2.0/” (note the trailing slash), which we were happy to crawl/index/rank. But when their linkage shifted enough that “/web2.0” became their preferred url, Google wouldn’t crawl urls ending in “.0”, so the page became uncrawled.

Even though urls ending in “.0” are often binary and therefore end up getting dropped later in our indexing pipeline, it’s always good to revisit old decisions and respond to feedback by running new tests. So just in the last day or so, we switched it so that Google is willing to crawl pages that end in in “.0”. This will help the small number of pages out on the web that want to serve up HTML pages with a “.0” extension.

You can see the results trickling into Google with a bunch of “X hours ago” fresh results:

0 file extension

So my quick takeaways would be:
– Why Google doesn’t crawl some filetype extensions (when we’ve seen good evidence that the extensions are mostly binary or otherwise not-very-indexable files).
– An easy was to use the filetype: operator, so that you can decide whether to avoid a particular filename extension yourself.
– Google is willing to revisit old decisions and test them again, which is what we’re doing with the “.0” filetype extension.

I hope that helps a few people who are considering unusual filetype extensions of their own. 🙂

39 Responses to Don’t end your urls with .exe (Leave a comment)

  1. Matt, I don’t know about the feasibility of this, but I wouldn’t mind seeing a “files” version for Google (like images, video, etc) that indexes all these things.

  2. It seems like wordpress converts titles to friendly urls but I’ve avoided using .,!?# and others for thought of urls not getting indexed properly. Maybe over paranoid but that’s just me.

  3. “Google is willing to revisit old decisions”
    Matt,
    Anything new from my old question?
    thanks,
    Scott

  4. One of the prominent users of strange extensions is Ars Technica with .ars. I always thought that it would be logical to use a “normal” extension like .html or .php to avoid any presentational or crawling related problems.
    But there should be some differences between .html and .php, right? .php is not as often crawled as .html because .php mean “dynamic” and stressful for the webserver. Since mod_rewrite .html can also be .php. So why does Google look at the extension? The MIME-type should be more useful.

  5. chuckallied >> Can’t you just use the search term Matt used in the example? So if you want to find files named foo.bar you search for:
    foo filetype:bar
    I guess that should solve your problem 🙂

  6. “But there should be some differences between .html and .php, right?”

    Bernd, I’d contend that there shouldn’t. Whether a page is dynamic or not isn’t really indicated by the extension. Plenty of “.html” pages are actually dynamic or generated on-the-fly by the server, for example.

  7. Nice find by the SEOmoz folks.

    I wonder if there were any other people previously wondering why their Web 2.0 pages or their stuff about radio stations, or different software versions, wasn’t ranking. Now we know why.

    It will be interesting to see how many new pages are ultimately found. I’m guessing it will be at least a few thousand.

    I have used .0 in a URL on several occasions, but it has always been either as a folder name with trailing slash, or else as a filename with a regular .html (or .php, or whatever) extension on the end.

    I’ve always been a bit wary of extensionless URLs.

  8. And for the love of God, don’t use a custom file extension of .seo

    http://www.google.com/search?q=ext:seo

  9. Oooook. filetype:0

    Heading for half a million results already….

  10. But what about using the .sem extension then?

    http://www.google.com/search?num=100&q=ext%3Asem

  11. That explains an issue I had with a clients site a while ago.
    Thank you for this, now I know what to do, I had virtually given up on ever finding the problem.

  12. Reminds me of the old days of SEO, but in reverse. So what is the file extension that causes a #1 rank?

  13. > There’s a simple way to check whether Google will
    > crawl things with a certain filetype extension.

    Google Code Search crawls zip files even if they don’t show up using that test…

  14. The interesting thing to me about all of this is the site would not have had any problem at all if they would have stuck to their original url with the trailing slash. It’s SEO 101 to use a trailing slash for a page that is the index page of a folder.

    Good stuff though for Google to revisit the url ending in zero.

  15. john andrews, surely that would be the .googlepray extension, to mirror the googlepray meta tag? 😉

    Philipp Lenssen, you’re correct, but I was talking about the main web search. Our code search is willing to delve into things like zips and tarballs to find code.

  16. I’m somewhat curious, Matt (although I suspect I know some of the answer).

    http://www.google.com/search?hl=en&rls=GGLL%2CGGLL%3A2008-07%2CGGLL%3Aen&q=Texas+Land+40+acres

    If .dll is one of those extensions big G has trouble with, how is it that eBay has a result with a .DLL extension in there? Is it because of the MIME type, as Bernd suggested?

  17. Matt,
    I love it when you give examples that are that specific.
    As you know I am a strong promoter of Open Source Google!
    I never would have considered this one way or the other, and then could have ended up with indexed pages. Thanks for the exactitude!

    I believe it was g1smd who suggested years ago at Webmaster World to end our URL’s with a /

    I have been trying to do that ever since.

    Now it makes more sense why.

    dk

  18. Matt,

    This is a great documentation. Should we expect to see it added to GOOG quality guidelines (maybe under;: Design and content guidelines)?

  19. So would this have had an effect on SMF forums with urls ending in
    /index.php?board=6.0
    /index.php?topic=216.0

  20. Hey,
    I and a few others just realized that “Google reserves the right to Terminate your account at any time, for any reason, with or without notice.”. For example, if you get in trouble for some reason with Adsense, which can happen even though you don’t click yourself, you might get your GMail account suspended?

    The though of losing my Mail really scares me and I would really like to hear if that can be the case, in the example above. I would think twice before having my company e-mail outsourced to Google if it works like that.

  21. M.W.A., that’s because the url is “cgi.ebay.ca/ws/eBayISAPI.dll?ViewItem&item=270224738405”. If the url were “cgi.ebay.ca/ws/eBayISAPI. dll” then we wouldn’t have crawled it, but since there was something else in the url that followed the .dll, Google was willing to crawl the url. Note that I said Google won’t crawl the url if ends directly in .exe or .dll or whatever. So http://www.example.com/page.exe we wouldn’t crawl. But http://www.example.com/page.exe?whatever we probably would be willing to crawl.

    Harith, my hunch is that this is a little detailed for our webmaster guidelines, but maybe we could turn this post into some sort of official documentation somewhere.

    Kristofer, I’ve never heard of that happening. My hunch is that Google has stuff like that in our TOS so that if email/CAPTCHA spammers are creating bunches of accounts, we can disable those. So I wouldn’t worry about this, in my opinion.

  22. Matt,

    In that example you´re giving, FM_92.0 is now indexed. But what about 92.1
    92.2
    92.3
    etc.

    and for that matter you may also expect:

    92.10
    92.20
    92.30
    etc.

  23. Hi Matt,

    Thanks for clearing some things up again.

    I assume Google not only keeps these specific kinds of files (executables, binaries) from the SERP’s because it is possible they aren’t ordinary web pages, but also because downloading these files could be a security risk to users, when an executable contains hazardous code.

    Does Google only look at the last characters of an URL, or do you also look at content types etc? Because else one could still get a file into Google by linking to it like http://www.domain.com/somevirus.exe?hello=world (adding a random parameter).

    It seems logic to block some extensions of files that are, according to the web standards, no part of a web page, but ignoring the extension if a GET parameter is provided sounds a bit inconsistent to me. Especially when there are better ways to determine of which type a file is.

    Maybe you could tell a bit more about why Google is handling the extensions this way?

  24. matt seems like mozzers are back with there web2.0 page in Google index !!
    http://ankitrawat.com/blog/seomozz-got-its-web20-page-back-in-google/
    what u have to say abt this ?

  25. Hi Matt,

    Interesting Article.

    You briefly mention use of the trailing slash in filenames, i.e. http://www.somesite.com/some-content/

    I have used various URL Re-Write methods in the past, some using the trailing slash and some not.

    Also, sites with no URL Re-Write in place but with, for example, default.aspx or default.php placed in various folders are quite common so as to achieve “re-write effect” URLs without any actual rewriting.

    I am just wondering whether “trailing slash” filenames are prefereble over non slash URLs, or vice-versa.

    I suppose “trailing slash” URLs will be one character longer, accounting for the slash, but would that have any advantage/disadvantage when it comes to search ranking?

    Would a “trailing slash” page be given less relevance as it could be interpreted as looking as if it belongs a section below the parent document?

    Interesting things URLs, they vary so much from site to site.

    Mike.

  26. What about no extension and no trailing slash? This is the way Drupal creates pages when you turn on friendly URLs.

    I guess since a site:drupal.org query returns 1.4 million results with many of them having no trailing slash and no extension it’s fair to assume Google is OK with this?

  27. Ankit, I think that’s a result of Google responding to feedback and trying to crawl urls that end in “.0”; we’ll see how many binary packages come in as well before making a final decision.

  28. >>Jarlskov>> I could, and do, but in the same way Google image search displays image results differently, it’d be nice to see a Google File search that likewise includes sorting and display features specifically designed for files.

  29. *** use of the trailing slash in filenames ***

    No. The trailing slash is NOT used for a filename.

    Trailing slash is the correct canonical form for a FOLDER.

    Redirect to that form, from the one without it.

  30. on the topic of statics versus dynamic pages.

    I am surprised to learn that Google does not use the page extension as an indicator of quality. Even if a small one.

    Obviously anyone can hide a dynamic page through mod rewrite etc but still I would have thought that overall on the web there is more webspam in php, asp, aspx, cgi etc pages.

    Google, having the worlds best data set to test these things could do a correlation between extensions and quality and I am suprised that there is not a significant lean towards /html or no extension having better quality pages.

    Is there really no difference in the quality of pages with php vs htm or is it moreso that the colateral damage to good quality dynamic sites is too high?

  31. M.W.A., that’s because the url is “cgi.ebay.ca/ws/eBayISAPI.dll?ViewItem&item=270224738405″. If the url were “cgi.ebay.ca/ws/eBayISAPI. dll” then we wouldn’t have crawled it, but since there was something else in the url that followed the .dll, Google was willing to crawl the url. Note that I said Google won’t crawl the url if ends directly in .exe or .dll or whatever. So http://www.example.com/page.exe we wouldn’t crawl. But http://www.example.com/page.exe?whatever we probably would be willing to crawl.

    This explanation makes some sense. I have always used the “definition” of a URL as provided by ASP’s ServerVariables collection, which doesn’t include the querystring (which is a separate animal).

    But this leads to another concern, not so much for myself personally but for the average user who may not be so tech-savvy (i.e. the one Sint asked). By this explanation, theoretically http://www.some-domain.com/some-virus.exe wouldn’t be indexed, but http://www.some-domain.com/some-virus.exe?this=a_random_querystring would. I know a lot of people who don’t bother to look at the extension of a link before they click on it (in fact, I would say the vast majority wouldn’t) and may well stumble upon something they’re not supposed to.

    It’s also theoretically possible, through cloaking, to present “valid” HTML content to a bot and something completely different to what is perceived to be a human.

    I think Sint has a good question…obviously the querystring as a standalone cannot be considered. So what else is?

  32. Hello Matt,

    I have still a question about urls. I know now that .html or .php are the same, but what about pages without an extension?

    example:
    http://www.domain.com/product1.html
    http://www.domain.com/product1

    Which do you suggest we use for our websites?

    Thanks!

  33. yeah, THAT would interest me too!

    example:
    http://www.domain.com/product1/index.html
    http://www.domain.com/product1/

    is it better to link to a file (index.html) or the folder?

  34. For me it really doesn’t matter if there is a .html extension or I only use a “folder” … I’ve got projects ranked well with both and if you have a look at the most blogs with speaking urls, you’ll see that they probably all use the “folder” thing. Matt’s blog too. 😉

    For me “folder/index.html” is the same like “folder/” cause the index file would be called automatically so I would drop the index.html part. But I’m really sure “domain.com/folder/” would be better than “domain.com/folder” (missed trailing slash).

    Maybe Matt could bring some light into but currently I’m pretty sure it doesn’t matter for Google for the “index.html”- or the “product1/”- vs. “product1.html”-thing.

  35. Hello Matt,

    Similar question to the two previous posters:

    category.example.com/item-title/
    vs
    http://www.example.com/category/item-title/

    I’m more interested in which url is preferred by Google visitors than which ranks better. The sub-domain category is shorter, but the www domain url could be considered more safe. Has Google done any testing of this?

  36. In the Google webmaster tools, an URL “folder/index.html” ist handled like “folder/” – and even changed to that…

  37. I have to agree René (the previous poster) Got the same experience with Google. No matter what kind of endings the urls have. Google indexes them all and they have the same results in the SERPS. Maybe outside of Germany Google works in another way.

  38. Matt
    What about the possible impact of changing all file extensions on an old site with strong credibility from .htm to .html?
    Richard

  39. Did Google changed the policy ?

    I tried
    filetype:00
    filetype:000
    filetype:00000
    filetype:01200
    filetype:99200
    filetype:93520
    filetype:01100
    filetype:1325201

    And all of them have results that end exactly as stated in the filetype. It doesn’t seem Google had explicitly allowed these file extensions right?

css.php