One of my favorite computer science papers is a 1990 paper titled “An Empirical Study of the Reliability of UNIX Utilities”. The authors discovered that if they piped random junk into UNIX command-line programs, a remarkable number of them crashed. Why? The random input triggered bugs, some of which had probably hidden for years. Up to a third of the programs that they tried crashed.
That paper helped popularize fuzz testing, which tests programs by giving random gibberish as input. Some people call this a monkey test, as in “Pound on the keyboard like a caffeine-crazed monkey for a few minutes and see if the program crashes.”
I can tell you that the web is a fuzz test. If you write a program to process web pages, there are few better workouts for your program than to pipe a huge number of web pages into your program. I’ve seen computer programs that ran with no problem across our entire web index except for *one* document. You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled “An Investigation of Documents from the World Wide Web,” Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error:
weblint was used to assess the syntactic correctness of a subset of the HTML documents in our data set (approximately 92,000). …. Observe that over 40% of the documents in our study contain at least one error.
At a search engine, you have to write your code to process all that randomness and return the best documents. By the way, that’s why we don’t penalize sites if they have syntax errors or don’t validate — sometimes the best document isn’t well-formed or has a syntax error.
But: the web is a fuzz test for you too, gentle reader. As you surf the web, your browser is subjected to an amazing amount of random stuff. Here’s a scary example: a couple months ago, someone was surfing a website and noticed that an ad was serving up malware. I know of a completely different web site that apparently got hit by the same incident.
So the take-aways from this post are:
- Fuzz testing is a great way to uncover bugs.
- Lots of great web pages have syntax errors or don’t validate, which is why we still return those pages in Google.
- If you’re an internet user, make sure you surf with a fully-patched operating system and browser. You can decrease your risk of infection by using products off the beaten path, such as MacOS, Linux, or Firefox.
- If you’re a website owner and Google has flagged your site as suspected of serving malware, sometimes it’s because your site served ads with embedded malware. Check if you’ve changed anything recently in how you serve ads. When you think your site is clean, read this post about malware reviews and this malware help topic for more info about getting your site reviewed quickly. Even if your site is in good shape, you might want to review this security checklist post by Nathan Johns.
Update Nov. 15 2007: Fellow Googler Ian Hickson contacted me with more recent numbers from a September 2006 survey that he did of several billion pages. Ian found the number of pages to be 78% if you ignore the two least critical errors, and 93% if you include those two errors. There isn’t a published report right now, but Ian has given those numbers out in public e-mail, so he said it was fine to mention the percentages.
These numbers pretty much put the nail in the coffin for the “Only return pages that are strictly correct” argument, because there wouldn’t be that many pages to work with. That said, if you can design and write your HTML code so that it’s well-formed and validates, it’s always a good habit to do so.
By the way, if this is the sort of thing that floats your boat, you might want to check out Google’s Code blog, where Ian has posted before.