I’m still way behind on my email and blogreading, but I’ll go ahead and mention 2-3 things that I’ve come up to speed on.
I was happy to see that by the time I heard of some issues, they were already resolved. On June 16th, Matt Mullenweg posted that he’d been banned from Google. Happily, Matt kept updating the status as he learned more. It turns out that someone had uncovered Matt’s password by scouring the source code for a new project Matt was working on. The bad guy flipped on a privacy feature on Matt’s blog that added a “noindex” meta tag. And we know what the noindex tag does. When Matt figured this out, he removed the noindex tag and he’s back in Google now. In general, if your server is down for a few days and Googlebot can’t crawl your pages, those pages can drop out of our index. But when the pages are alive again, Google will often find the pages quickly and you should usually return to where you were before.
Ruslan Abuzant noticed what looked like a fragment of a server status page. He posted over at Digital Point Forums, and people there debated if the fragment was real or not. Yes, it was real. No, I’m not going to comment on what any of it means. 🙂 Folks have taken steps to keep it from happening in the future, but personally, I think that we need to start including some extra settings for fun. I’d say that we should add
–initial_time_travel_wormhole=”Wednesday, December 31 1969 11:11 pm”
Let’s see, what else. When I saw the obligatory “Google found data we didn’t want indexed” article that I missed while I was gone, I almost didn’t bother to ask around. Barry covered this story pretty well when he noted that Googlebot doesn’t go around guessing passwords. I assumed that someone left the information lying out somehow, or that there was a hyperlink out on the web with a username and password embedded in it. When I had a chance to talk to a colleague back at Google, I got a little more info though. He said he didn’t mind if I reprinted what he found:
The URL was on a server that this school district thought was password-protected. Before they took down the server, I was able to retrieve the live URL. I was getting a username/password login page with a regular Firefox user agent, but I got a server error when I changed the UA as Googlebot. I changed back to Firefox and was able to retrieve the username/password page again. It seems their document system was cloaking to Gbot, likely unbeknownst to the people who are writing us now and requesting the removal.
That’s my best guess for how the information got into Google. Of course it’s a moot point now because the urls are no longer in Google. But that’s what prompted me to write a short/sweet “How to herd Googlebot” post. If you administer a web server which has information that you don’t want to be public, it’s easier to exclude content in advance than to try to remove it from search engines later.
Again, I’m still catching up, but I’m planning to discuss at least a couple more things that I missed while I was on vacation.