Archives for August 2007

Closing the loop on malware

Suppose you worked at a search engine and someone dropped a high-accuracy way to detect malware on the web in your lap (see this USENIX paper [PDF] for some of the details)? Is it better to start protecting users immediately, or to wait until your solution is perfectly polished for both users and site owners? Remember that the longer you delay, the more users potentially visit malware-laden web pages and get infected themselves.

Google chose to protect users first and then quickly iterate to improve things for site owners. I think that’s the right choice, but it’s still a tough question. Google started flagging sites where we detected malware in August of last year. This February, the webmaster console team and Google’s anti-malware team took a big step toward closing the loop for webmasters:
– The webmaster console started listing example urls with suspected/detected malware.
– Google began attempting to email site owners when we detected malware.

Today, the two Google teams added even more functionality into the webmaster console:

– New: Request a malware review from Google and we’ll evaluate your site.
– New: Check the status of your review.
* If we feel the site is still harmful, we’ll provide an updated list of remaining dangerous URLs
* If we’ve determined the site to be clean, you can expect removal of malware messages in the near future (usually within 24 hours).

I like that Google will keep updating the list of dangerous URLs for a site, and that they’re working to remove malware warnings even faster when sites clean up malware. That will help site owners diagnose their problems and get them fixed faster. What’s just as exciting to me is that while I have written about malware unofficially in the past, Google has ramped up official posts about malware on Google’s online security blog.

I’m glad that the Google’s anti-malware team has been doing all this stuff to alert site owners if they’re hosting malware. I don’t think it generates any money for Google (if anything, it costs machine resources and engineer cycles to tackle malware), but it does improve the web as malware gets taken down faster. I guess there could be an indirect effect as people trust the web more and maybe surf more, which is good for everybody.

Whitehat SEO tips for bloggers

Okay, I’ve got a bunch of pointers to summarize my WordCamp 2007 talk.

First off, here’s the PowerPoint deck that I presented. Google’s PR team was kind enough to verify that it was okay to release. I made the slides from scratch (not even a Google template), so there shouldn’t be any problems with notes in the slides or other metadata. Also note that I made this entire presentation the day of the conference, so let me know if there are unclear parts.

“But Matt, some of that talk is just bullet points! Where’s the context?” you might comment. Ah, I’m glad you mentioned that. John Pozadzides attended WordCamp and taped the talks, and he recently put up a video of the talk.

“But Matt, I don’t have an hour to spare to watch the video!” you might comment. Ah, I’m glad that you mentioned that. David Klein was at WordCamp, and he transcribed the talk into text form.

“But Matt, that transcript has a lot of words. It could take me 20-30 minutes to read all that!” you might comment. Well, I’ve already pointed to Stephanie Booth’s write-up of the session. You could also read the summary that Lisa Barone wrote. Or check out Stephan Spencer’s coverage for CNET.

Now you understand why I blogged about Alex Chiu a while ago; I used him as an example in my talk, so I wanted to explain what those two urls in my PowerPoint meant.

If you read Stephan Spencer’s write-up, he says some people thought that underscores are the same as dashes to Google now, and I didn’t quite say that in the talk. I said that we had someone looking at that now. So I wouldn’t consider it a completely done deal at this point. But note that I also said if you’d already made your site with underscores, it probably wasn’t worth trying to migrate all your urls over to dashes. If you’re starting fresh, I’d still pick dashes.

I also wanted to point out something I’m pretty proud of. If you were at the site review session at Pubcon last year in Vegas, you might remember that there was a chiropractor who wanted to do well for the query [san diego chiropractor]. At the time, Danny Sullivan teased him a bit and said “Well, you might want to put the words ‘San Diego Chiropractor’ together on the page that you want to rank.”

Well Danny, that site owner was David Klein and he took all the PubCon advice from the panel to heart. He started a blog, tweaked the copy on his site, and has even started to learn great linkbaiting techniques. For one thing, he transcribed the video of my talk, which traded some effort on his part to create a useful resource. Even better, he came to WordCamp with a creative idea, a pad of paper, and a digital camera. As he met folks at WordCamp, he had each person write their name, their website, and something that they wanted to do. Then he created an original cartoon of that person doing that thing. Go to the post with Matt Mullenweg and click on the picture of Matt to see what I mean. Matt said he wanted to be a writer, so David posted a cartoon of Matt as a writer.

How is this smart? People love to talk about themselves, and love to see themselves in the spotlight. So these little cartoons are natural linkbait: “Hey look, he drew me as a Photoshop plug-in developer!” How much did it cost to do this particular idea? Practically nothing: just the initial creative brainstorming and a little bit of elbow grease.

It was neat to see a regular site owner go from not knowing much about SEO in November 2006 to really improving his traffic with some creativity and straightforward changes. A good SEO can tune up your web site. But if someone is willing to take the time to study SEO, look for fresh ideas, and put in some effort, a regular person can definitely improve their website (and rankings!) as well. To see that come true with a chiropractor that several of us gave feedback to just last year was really exciting. That’s one of the big things that has stayed with me from WordCamp.

Update: Clarifying that Stephan’s write-up didn’t say that dashes and underscores were the same. Thanks, Stephan!

Minty Fresh Indexing

When I joined Google in early 2000, we had a stretch where we didn’t update our index for 3-4 months or more. At the time, that wasn’t bad for a search engine; I remember one search engine around then that wasn’t updated for over a year. Starting in mid-2000, Google updated our index pretty much every month. People used to use the phase of the moon to predict timing of the next “Google Dance.” 🙂

Now raise your hand if you remember “Update Fritz” from summer 2003. That was the Google Dance where Google switched from a monthly batch update to an incremental update. That means that our crawl/indexing team updated a fraction of our index daily or near-daily. Back then we had not only the normal crawl but also a “fresh crawl,” and if documents were in the fresh crawl then Google would sometimes show a date in our snippet.

The Google crawl/indexing team has continued working hard, and several people have noticed Google’s index getting fresher and fresher. Now some documents can show up in minutes instead of hours or days.

I’ve noticed that as search engines have gotter better (fresher, bigger, more relevant), people keep adjusting their expectations upwards. I can’t imagine waiting over a month for search engines to update their index with news events any more, but just a few years ago that’s how things worked. And it only takes a few encounters with a fresh index until you ratchet up your expectations. My previous mental model was “normally it takes a day or so to show up in many search engines,” but I had my own “Zoiks! That’s fast!” experience tonight, which I’ll describe for you.

I was feed-grazing in Google Reader, as I am wont to do, when I saw a message that there was an update to Reader’s code for offline reading (Google Gears). In my experience, if I move on the the next feed, I lose that little message with the link to update the code (not sure why, but that’s a different post). So I click the link and update my code for offline reading.

In the process, I lost the post that I was currently reading, which was Rich Skrenta’s post about Persai. I wasn’t done reading the post, so what do I do? I go to Google and search for [skrenta blog] so that I can find Skrenta’s blog and finish reading the post.

And what did I see in my search results? The snippet from Skrenta’s blog was showing the post that he did at 7:54 p.m. Pacific time. It was about 8:44 p.m. Pacific time when I did the search. So from Rich hitting the “Post” button to me being able to see it in Google’s main search index was well under an hour.

Don’t believe me? Here’s the bottom of Rich’s post, showing that it went live at 7:54 p.m.:

Post went live at 7:54 p.m.

I double-checked that Rich’s blog was on Pacific time by leaving a quick comment on one of Rich’s other posts.

And here’s Google showing a snippet from Rich’s post within an hour after it went live:

Updated post in Google within an hour

Now that’s a minty fresh index. 🙂 It takes a lot of good design and infrastructure to be able to refresh large numbers of pages that fast. Congrats to the Googlers who are improving Google’s ability to re-crawl, index, and score web documents quickly.

Update: I was only checking every 10 minutes or so, but this post was crawled/indexed/searchable in half an hour or less:

My post was picked up quickly!


Sorry for the lack o’ blogging for the last few days. Here’s what I’ve been up to.

– Early last week, I was at an all-day offsite with my team. It was fun, but meant that I had to catch up on work/email.
– Later in the week, in-laws came into town to visit for the next week or so.
– This morning, I tweaked my back pretty badly. I spent most of today lying in bed and I’m going to stay home tomorrow.

I’m hoping to get a chance to blog more this week, but we’ll see.