The web is a fuzz test: patch your browser and your web server

One of my favorite computer science papers is a 1990 paper titled “An Empirical Study of the Reliability of UNIX Utilities”. The authors discovered that if they piped random junk into UNIX command-line programs, a remarkable number of them crashed. Why? The random input triggered bugs, some of which had probably hidden for years. Up to a third of the programs that they tried crashed.

That paper helped popularize fuzz testing, which tests programs by giving random gibberish as input. Some people call this a monkey test, as in “Pound on the keyboard like a caffeine-crazed monkey for a few minutes and see if the program crashes.” 🙂

I can tell you that the web is a fuzz test. If you write a program to process web pages, there are few better workouts for your program than to pipe a huge number of web pages into your program. 🙂 I’ve seen computer programs that ran with no problem across our entire web index except for *one* document. You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled “An Investigation of Documents from the World Wide Web,” Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error:

weblint was used to assess the syntactic correctness of a subset of the HTML documents in our data set (approximately 92,000). …. Observe that over 40% of the documents in our study contain at least one error.

At a search engine, you have to write your code to process all that randomness and return the best documents. By the way, that’s why we don’t penalize sites if they have syntax errors or don’t validate — sometimes the best document isn’t well-formed or has a syntax error.

But: the web is a fuzz test for you too, gentle reader. As you surf the web, your browser is subjected to an amazing amount of random stuff. Here’s a scary example: a couple months ago, someone was surfing a website and noticed that an ad was serving up malware. I know of a completely different web site that apparently got hit by the same incident.

So the take-aways from this post are:

  1. Fuzz testing is a great way to uncover bugs.
  2. Lots of great web pages have syntax errors or don’t validate, which is why we still return those pages in Google.
  3. If you’re an internet user, make sure you surf with a fully-patched operating system and browser. You can decrease your risk of infection by using products off the beaten path, such as MacOS, Linux, or Firefox.
  4. If you’re a website owner and Google has flagged your site as suspected of serving malware, sometimes it’s because your site served ads with embedded malware. Check if you’ve changed anything recently in how you serve ads. When you think your site is clean, read this post about malware reviews and this malware help topic for more info about getting your site reviewed quickly. Even if your site is in good shape, you might want to review this security checklist post by Nathan Johns.

Update Nov. 15 2007: Fellow Googler Ian Hickson contacted me with more recent numbers from a September 2006 survey that he did of several billion pages. Ian found the number of pages to be 78% if you ignore the two least critical errors, and 93% if you include those two errors. There isn’t a published report right now, but Ian has given those numbers out in public e-mail, so he said it was fine to mention the percentages.

These numbers pretty much put the nail in the coffin for the “Only return pages that are strictly correct” argument, because there wouldn’t be that many pages to work with. 🙂 That said, if you can design and write your HTML code so that it’s well-formed and validates, it’s always a good habit to do so.

By the way, if this is the sort of thing that floats your boat, you might want to check out Google’s Code blog, where Ian has posted before.

49 Responses to The web is a fuzz test: patch your browser and your web server (Leave a comment)

  1. Wow! Any other computer science paper classics you would like to share?

  2. Matt, I’ve always seen your comments section here as some form of fuzz test. The only thing is I don’t know if you’ve ever crashed or encountered a personal bug. 😉

  3. Great advice mat. Thank you. Having the whole index at your disposal must be a huge benefit to testing. I’m kind of envious. 😉

  4. Thanks for clarifying once again that Googlebot doesn’t penalize for not passing validation. I get harrassed all the time that OneCall doesn’t pass validation in just about every webmaster, SEO, or code forum out there. I am going to add this post to Google Bookmarks and toss it back next time I get a Valid HTML code freak out there. 😉

  5. It is important that this column frequently addresses rumors that spread across SEO blogs and forums. There are SEO theories that well formed documents get better rankings.

    In reference to the examples given for fully patched browsers and operating systems, Windows VISTA along with and updated I.E. 7 is almost unbreakable.

    BTW:
    Look at what SEarcHEngineSWEB just accomplished by posting on this blog. This makes about 20. 😮

    http://blogoscoped.com/forum/114161.html

  6. Matt

    Welcome back from the logistics planet 🙂

    I thought that you will update us about a possible backlinks and PageRank update which started around the 5th October 2007 😉

  7. Sorry

    Correction: “started around 5th November 2007” instead of “started around the 5th October 2007” .

  8. Wish I had read this a week or so ago! Seems between the malware issues and MSN dumping a links [making all my clients call and ask about pageRank drop] I have no time to do the things I would like to do…

    Like come look for some more minty fresh humor… thanks Matt. I think google is not annoucing an update because there was not an update. Just a lot of MSN links went missing!! causing page rank to drop by 2pts. for most that were banned by MSN for linking strategies.

    If this were the case…. Could / Would / Should you also ignore MSN links so the dumb slider goes back up?

    thanks,
    seharness

  9. Matt,

    Thanks for the post, I was starting to worry about you, you hadn’t posted in a while. Remember when you first started blogging, and you were so excited about it, and you posted all the time, no matter what it was about, and we all loved it, and saw a human face in Google, and it was totally awesome?

    What’s changed?

    SEO aside, when you don’t post for like 10 days, we get withdrawals, you used to be the first place to check when I saw anything change in the serps. Granted this whole industry is growing, and you don’t strike me as the kind of guy that like all this attention, so I understand. But, throw us a bone, a movie review here and there, give us your thoughts on some great viral you’ve seen. I’m just speaking for myself but, I think some would agree, you’re an interesting guy, and you have a lot more to offer than just SEO, and Google bidness.

    Embrace your inner blogger and go nuts on any old topic again, we’ll read it, and enjoy. Make a video discussing your mastery of cat makeup, something anything.

    BTW great post, another one of those, I feel smarter now thanks to Matt Cutts posts.

  10. You touched upon one of my pet peeves with web developers, they just don’t test their code as thoroughly as they should. One aspect that everyone should be aware of is directly passing parameters to a mysql query (or any other program) without properly parsing and validating the input. It’s a very dangerous practice that is more common than most people realize.

  11. The Robustness Principle: “Be liberal in what you accept, and conservative in what you send”.

    So, just because Google will attempt to parse badly formatted junk, that doesn’t mean that you should deliberately make the site using badly-formed code.

    You should get your site to a state where it is guaranteed to be 100% correctly parsed and indexed. Don’t be that one document that they can’t access.

  12. Ahh… It reminds me of the good old days, when you could still hire monkeys to work for you.

    Give me 10 good strong African chimps fired up on bananas and caffeine, with cable Internet connections and I could take over the world. Just imagine the quality content, and never, never, never have I ever seen monkeys, or chimps create duplicate content,.. (except of course for that one time with the bible and the complete works of William Shakespeare.)

  13. Dr. David Klien wrote:
    Give me 10 good strong African chimps fired up on bananas and caffeine, with cable Internet connections and I could take over the world.

    well, here you are!!!
    http://www.newtechusa.com/ppi/faq.asp

  14. A very good post Matt but regarding your list of things to take away from this post. You mention:-

    If you’re an internet user, make sure you surf with a fully-patched operating system and browser. You can decrease your risk of infection by using products off the beaten path, such as MacOS, Linux, or Firefox.

    Surely this is akin to security by obscurity which is a very bad practice. Instead of trying to use obscure programs people should learn to keep their software updated. Although the packages you mentioned are good it does not mean they are safe. Firefox for example has many bugs (some articles even claiming more than internet explorer although that is probably Firefox being more open about their reported bugs).

    As the pckages you mention get a stronger market share people will start to gear their efforts towards them more and if the user simply changed so they could get away from exploits theyn they will be in for a shock as they have not learnt about getting themselves into the mentality of security.

  15. A question for Matt. Take a site accessible at:

    – example.org/
    http://www.example.org/
    – example.otherdomain.com:30080/

    All three versions are indexed; thousands of pages for each.

    The canonical domain is supposed to be: http://www.example.org

    Site-wide redirect (301) set up from example.org to http://www.example.org a few months ago.

    The example.org URLs are gradually dropping out of the index.

    The number of http://www.example.org URLs has increased too.

    The duplicate site at example.otherdomain.com:30080 cannot be redirected. Nore the required port number.

    There is a robots.txt file at:

    example.otherdomain.com:30080/robots.txt (note the required port number)

    The robots.txt file has this in it:

    User-Agent: *
    Disallow: /

    The example.otherdomain.com:30080 site URLs are NOT dropping out of the index.

    I assume that Google is looking at example.otherdomain.com/robots.txt for the robots file.

    That URL does not work. The port number MUST be added to access the robots.txt file.

    How to get Google to see the robots file at: example.otherdomain.com:30080/robots.txt ?

    Will Google automatically try using the same port number on the robots.txt URL if it finds a port number on any of the normal page URLs?

    How to get those duplicate URLs de-indexed?

  16. I just like the words “fuzz testing”…lol

  17. I think Brandon (above) said it best. Heh.

    I also think that fuzz testing, just like life, is best done in moderation, and that you can’t plan for everything.

    Like Nick said above, it must be nice to have the entire index at your disposal. He’s got a point, even if he can’t spell your name properly. Given that it’s IN THE URL he probably got it right at some point eh?

  18. g1smd, well said on the Robustness Principle. On your other question about port numbers: the ideal way would be to redirect the urls at the unusual port number. However, I believe that our new url removal tool can handle port numbers. I know the old url removal system couldn’t handle non-standard port numbers, which was a pain point for some users because those requests would need to get sent to user support and we’d have to tackle them on our side. So my guess is that the new url removal system specifically tried to tackle that weakness in the second iteration.

    bob rains, I’ll try to post a little more this week. My inlaws were in town since about Wednesday and we went into the city (San Francisco) for a few days; I prefer not to post about trips until after I get back, just so people don’t say “Hey, Matt’s out of town. I’ll stop by Matt’s house and take Matt’s cats for a joyride.” 🙂

    But I take your point bob rains, and it’s a good one. One small part of me feels like I’ve covered many of the basic topics that I wanted to cover at least once on my blog (“Question: Does advertising on Google help your search rankings? Answer: No.”). Part of why I wanted to start a blog was to have a permalink that I could point to where I could write the answer to a question once and then point people to that link from then on.

    Frankly, there’s also a part of me that’s been holding back on blogging a little bit because 1-2 people have been throwing negative accusations my way or Google’s way, and I’m thinking about the best way to respond. Suffice it to say that there’s always at least two sides to any story, and I’m pondering how best to say some of that stuff. I think that’s solvable though.

    The last thing is that in 2005 there was a real need for someone from a search engine to be blogging, and we didn’t really have an official webmaster blog or any resource like that. We do have an official blog now, and if you review the recent posts there, they’ve been really information-rich. That’s not even mentioning the webmaster console, plus the Google discussion group.

    So I am pondering for my blog what the best role would be. Around this time last year we had the SES in San Jose with a few people enjoying calling themselves Cuttlettes, e.g. http://www.lyndseo.com/blog/?m=200608 . On one hand, that’s incredibly flattering (Hi Lyndsay!). But I also started doing some thinking about how it’s not healthy if people only identify me with Google’s webmaster relations and don’t give credit to all the other people that work really hard at Google to tackle spam or to communicate with webmasters or to craft tools for webmasters.

    Sometimes you see that “attribute things to Matt” idea taken to an extreme, e.g. http://sphinn.com/story/4415 where someone makes the claim “Rand Fishkin (who condones link buying) does not like directories so he has been complaining to his buddy Matt Cutts and Matt has gone out and manually penalized a large number of the leading directories.” I think that ascribing some assumption about Google’s actions to me is out of proportion with my actual impact at Google.

    Anyway, since the beginning of this year, I’ve been making a conscious effort to get the spotlight off of me a little bit and onto some of the other great people at Google as they take on more communication. I think if you go and check out not only the quantity but the variety of different blog authors on the official Google webmaster blog, it’s amazing. In many ways, Google has got lots of different voices doing more communication. There’s Maile doing privacy videos, Susan and others in the Google webmaster groups, Adam just got back from SMX Stockholm and immediately organized a WordPress meeting at the Googleplex, we’ve had folks go to SEO meet-ups in India, there’s a China webmaster blog, folks have been doing webmaster blog posts in Polish and tons of other languages. It’s just wild how many different people are contributing. And I think for every person you know about by name, there’s 1-2 more people who look at feedback from across the web and respond to that. Google is going to exit 2007 with so many more people working on webmaster communication and tools than at the beginning of 2007. I’m hugely grateful for all those people, and I’ve been consciously trying to give some of those people more of a chance to become known. Anyway, it’s late, but I hope that gives a little bit of explanation why I’ve been shying away from doing a ton of search posts lately (in addition to this big non-webspam project I had).

    I always get a little pensive in the fall. It’s a good season to ask questions like “What is my favorite part of webmaster communication?” (Maybe videos, to tell the truth.) Or to ask “Should I do more basic SEO posts, or tackle some really obscure, how-many-angels-could-dance-on-the-head-of-a-pin type question that only a few SEO experts will care about?” I’ve also been asking myself about the right balance of 1:1 communications (e.g. email) vs. one-to-many communications. I get a lot of questions by email and I’d like to respond to most of them, but that takes away some time that I could use to answer questions in a more public way.

    Like I said, I’m just pensive lately as the seasons change. 🙂 If folks have suggestions for how to spend my time (videos vs. email vs. blog posts vs. webmaster group comments vs. spending time with my family), let me know your thoughts.

  19. Glad to see you back Matt

    heh 1990’s some of us where doing this sort of testing back in the 80’s I did somthing similar on some RND project’s that the BSI was doing on iso9000 (program was an oil seal program running on a PET)

    I also manged to crash an A17 (UNISIS’s bigest main frame at the time Late 80’s ) doing similar testing 🙂 lets see what happens if i enter a maximal number here.

  20. Matt Cutts,is there is any extetrnal website link that u can suggest, which do scanning of website and find out syntax errors for that website?

  21. I saw “syntax error” on many pages during net surfing but that is for limited time, i mean when i visit same site after some time same page run without syntax error so is that mean syntax error stays for limited time ?

  22. Uhm, yeah, this comments system doesn’t allow me to talk about HTML freely, so the above syntax examples are messed now 😀

  23. SEarcHEngineSWEB for president

  24. Matt,

    “My inlaws were in town since about Wednesday and we went into the city (San Francisco) for a few days; I prefer not to post about trips until after I get back, just so people don’t say “Hey, Matt’s out of town. I’ll stop by Matt’s house and take Matt’s cats for a joyride.””

    Are you telling us that you leave Emmy under Oz mercy for days 🙂

  25. SEarcHEngineSWEB for president

    Of what? Nigeria? I hope so. According to those 419 emails, Nigerians seem to go through political figures about every 5 minutes.

    Phillipp, < and > are wonderful things. 😉

    Matt, if you’re looking for things to write about, how about getting more in-depth on Google departments you’re not normally part of? Interview an Adsense employee or something like that.

  26. Harith, when Emmy gets that right hook going, Oz clears out of the room. They both can defend themselves. 🙂

    Philipp Lenssen, I completely agree that things like open hrefs and dangling tables that are never closed and other syntax weirdness can affect a search engine’s rankings of a page. But my point is that Google typically wouldn’t drop a page or deliberately rank it lower just because of a small syntactic error or because it didn’t validate. It may accidentally rank lower (e.g. if you never closed a comment, so most of a page turned out to be in a comment accidentally, then that text wouldn’t help rankings), but Google typically wouldn’t penalize for weird/invalid syntax.

    And yes, we do look at how different browsers interpret a page, but we try to index a page intelligently for our ranking purposes. For example, a few years ago people would try to rank higher by using multiple TITLE tags. One common browser would show the last title, so you could do a ton of keyword stuffing in TITLE tags and yet have it look pretty to the user. We noticed that and took steps to interpret multiple title tags in a way that worked well for our users. So sometimes people make mistakes in their HTML, and sometimes people deliberately mess up their HTML, but we try to find ways to rank pages reasonably.

  27. Feel free to elaborate on this comment: “the way, that’s why we don’t penalize sites if they have syntax errors or don’t validate — sometimes the best document isn’t well-formed or has a syntax error.”

    But I think it would be better if the elaboration were shared on an official Google blog.

  28. This makes me think of writing image processing programs in school and how we were told that a robust JPEG processor would take years to complete just because of all those special cases and small errors. Well JPEG in particular because there is some insane math behind the compression but the same holds true for any file parsing program. There just so much that can go wrong.

  29. Thanks for the comments Matt. So, it looks like the meta robots disallow tag alone or the robots.txt disallow + Google removal tool are the only ways.

    Hmmm. Is that removal tool still a 90-day or 180-day thing, or is it more permanent these days?

  30. Off topic, but our server was hit by an aggressive crawler the other day from a Seattle ISP, in violation of the global-deny-with-exceptions in our robots.txt. We also prohibit any access from colocation and hosted ISPs in our User Agreement.

    Upon calling the ISP, I was told that the crawl was from a third party company that crawls on contract to Google.

    The User-Agent of the crawler was a generic Mozilla, with no indication that it was anything but a browser.

    Questions:

    (1) Is Google using third party companies to crawl like this in violation of robots.txt?

    (2) We block the entire ISP range of such ISPs with a 403 when we discover them. Will this hurt us in Google?

  31. Could that be the much rumoured cloaking bot, Stephen?

    It’s difficult to think of a way that Google could detect cloaked content other than by crawling using non-google useragents from non-google IP’s.

    doc

  32. Welcome back Matt,

    I really thought that Google was sorting out content from the source, but from what you say I understand that I was wrong. Therefore, a proper source code is much more important than one could believe.
    Thanks for pointing out this important factor.

    Cheers,
    Laurent

  33. So is this sort of like the new gmail interface that sends back some random code that crashes my browser every time I reply to something and hit “Archive” right away after that? I had to switch back to the older version to keep my browser running.

  34. g1smd, I think it’s for 180 days.

    Stephen, that sounds strange; why wouldn’t we just get the IP range ourselves? Post the IP address and I’ll poke at it a bit and see if I discover anything.

  35. Matt,

    Few words on what you wrote.

    “Sometimes you see that “attribute things to Matt” idea taken to an extreme, e.g. http://sphinn.com/story/4415 where someone makes the claim “Rand Fishkin (who condones link buying) does not like directories so he has been complaining to his buddy Matt Cutts and Matt has gone out and manually penalized a large number of the leading directories.” I think that ascribing some assumption about Google’s actions to me is out of proportion with my actual impact at Google.”

    You should expect such attacks. But you should always ask yourself; who is attacking me?

    You are fighting spam and spammers. You are preventing for example paid links merchants of making easy money. What did/do you expect Matt?

    Should such attacks and Google bashing campaigns have any negative effects on you and your writing? No of course not.

    Matt, the majority of webmaster/SEO communities admire and respect what you and Adam Lasnik are doing to communicate things to us. Thats what you should focus on. Not those filthy spammers 🙂

  36. 208.99.195.54

  37. I of course agree that its important to write good codes following the standards. But I think it shouldn’t go a cost of usability, some browsers are not working following the standards and thereby somethimes needs some kind of hack.

    Standards and valid xhtml is good, but like Google says. Whe’re making sites voor visitors, not for browsers en search engines…

  38. Only 40% had errors in ’96?

    I would be willing to bet that fewer than 1/4 of that now validate. :-()

    My fav Fuzz testing was taking a stack of punch cards for a Harris 1000 Mainframe, tossing them in the air, re-stacking them and then dumping them in the card reader. Never got the Mainframe to latch up but sure got some nice error printouts.

    A sys admin had to find something to do between programming ASCII pr0n and writing programs that would cause an off-tuned radio placed on top of the core to play recognizable songs, otherwise life got tedious.

  39. I had always believed that I was losing my time by validating all my HTML and CSS code (100%)… and with this post, I confirm my belief… at least up to now.

  40. Hi Matt.

    I wondered if you could help with something?

    My website has been up and running in google for around 1 and a half years.

    I recently made some new pages on the domain that would target certain things we would sell in the future. These pages contained title, desc, keywords tag, h1 tag and pretty much nothing else. I started to gather links to these pages mainly through articles I was writeing. They were all index in google and did quiet well(for blank pages anyway) for around three months.

    Around two weeks ago all these pages were removed from googles index and there tool bar PR greyed out.

    The domain has around 2000 pages index and has no problem with these what so ever. Its just the pages that were blank.

    The pages do not appear in google at all any more. The pages are not in the supplemental index and google appears to ignore them.

    I have now added links to these pages from the main domain and have added quality unique content consisting of around 500 words.

    Any advice on this would be greatly appreciated.

    Thank you.

  41. Point of post = use Mac OSX

    Well, that’s what I got out of it, anyway.

  42. Matt, I wanted to read archives of your blog but your next and previous entries links are misplaced. Previous should be left and Next should be right (or full right).

  43. “You are fighting spam and spammers. You are preventing for example paid links merchants of making easy money. What did/do you expect Matt?”

    Harith, I take your point, and my skin is pretty thick. The only thing I worry about is if the folks making negative claims start to convince other people (who probably don’t know the whole story). So the decision about how and when to rebut claims has been on my mind lately.

  44. In the pre-web days, working for a boxed software company, we called the arbitrary-input the ‘butt test’ — the IDEA of sitting on the keyboard and/or using other body parts to provide non-expected input.

    ‘Butt test’ was a euphamism, used when talking about testing, or at least we thought it was, until we found some summer interns doing some literal butt-testing…

  45. I can’t express how useful are these tips! This may explain several “unexplained” cases of apparently clean web-pages flagged red by Google or other phishing protection systems I stumbled upon during the last year or so. Thank you!

  46. Writing valid HTML is the holy grail, but few reach it. So it’s a good thing that Google still returns “not valid” pages, because otherwise we’d miss the relevant results in most of our searches. If Google did only return valid pages, you’d never find Matt Cutts blog : W3C’s validation tool gave me 20 errors for it’s main page.
    In my own opinion, as a web developper, it’s preferable to have a mostly valid page than to lose lots of precious energy trying to please the w3c validator.

  47. The only part of our pages that won’t validate are the affiliate ads (which we can’t do much about) and (ahem) the Google Search code, which assumes a page is written with XHTML 1, which only LSD-chomping hippies use these days (HTML 4 is the leading edge, waiting on HTML 5).

  48. I just like the words “fuzz testing”

css.php