Traveling for next few weeks

I’m going to be traveling for the next few weeks. I’ll be at three different conferences:

I’ll probably be much slower to respond to emails and tweets while I’m on the road.

Google launches two-factor authentication

Google just launched two-factor authentication, and I believe everyone with a Google account should enable it.

Two-factor authentication (also known as 2-step verification) relies on something you know (like a password) and something you have (like a cell phone). Crackers have a harder time getting into your account, because even if they figure out your password, they still only have half of what they need. I wrote about two-factor authentication when Google rolled it out for Google Apps users back in September, and I’m a huge fan.

Account hijacking is no joke. Remember the Gawker password incident? If you used the same password on Gawker properties and Gmail, two-factor authentication would provide you with more protection. I’ve also had two relatives get their Gmail account hijacked when someone guessed their password. I’ve also seen plenty of incidents like this where two-factor authentication would have kept hackers out. If someone hacked your Gmail account, think of all the other passwords they could get access to, including your domain name or webhost accounts.

Is it a little bit of extra work? Yes. But two-step verification instantly provides you with a much higher level of protection. I use it on my personal Gmail account, and you should too. Please, protect yourself now and enable two-factor authentication.

Google 2000 vs. Google 2011

I sometimes hear people say “Remember when Google launched and the results were so good? Google didn’t have any spam back then. Man, I wish we could go back to those days.” I know where those people are coming from. I was in grad school in 1999, and I remember that Google’s quality blew me away after just a few searches.

But it’s a misconception that there was no spam on Google back then. Google in 2000 looked great in comparison with other engines at the time, but Google 2011 is much better than Google 2000. I know because back in October 2000 I sent 40,000+ queries to google.com and saved the results as a sort of search time capsule. Take a query like [buy domain name]. Google’s current search results aren’t perfect, but the page returns several good resources as well as some places to actually buy a domain name. Here’s what Google returned for that query in 2000:

URL_1:http://buy-domain-name.domain-searcher.com/domains/buy-domain-name.shtml
URL_2:http://buy-domain-name.domain-searcher.com/buy-domain-name.shtml
URL_3:http://buy-domain.domain-searcher.com/domains/buy-domain.shtml
URL_4:http://buy-domain.domain-searcher.com/Map3.shtml
URL_5:http://domain-name-broker.domain-searcher.com/domains/domain-name-broker.shtml
URL_6:http://users5.50megs.com/buydomain32/
URL_7:http://users4.50megs.com/buydomain02/
URL_8:http://domain-name-service.domain-searcher.com/domains/domain-name-service.shtml
URL_9:http://domain-name-service.domain-searcher.com/Map2.shtml
URL_10:http://dns-id.co.uk/

Seven of the top 10 results all came from one domain, and the urls look a little… well, let’s say fishy. In 1999 and early 2000, search engines would often return 50 results from the same domain in the search results. One nice change that Google introduced in February 2000 was “host crowding,” which only showed two results from each hostname (here’s what a hostname is). Suddenly, Google’s search results were much cleaner and more diverse! It was a really nice win–we even got email fan letters. Unfortunately, just a few months later people were creating multiple subdomains to get around host crowding, as the results above show. Google later added more robust code to prevent that sort of subdomain abuse and to ensure better diversity. That’s why it’s pretty much a wash now when deciding whether to use subdomains vs. subdirectories.

Improving search quality is a process that never ends. I hope in another 10 years we look back and say “Wow, most queries were only a few words back then. And we had to type queries. How primitive!” Mostly I wanted to make the point that Google looked much cleaner compared to other search engines in 2000, but spam was absolutely an issue even back then. If someone harkens back to the golden, halcyon days when Google had no spam–take those memories with a grain of salt. :)

How to strip JPEG metadata in Ubuntu

If you want to post some JPEG pictures but you’re worried that they might have metadata like location embedded in them, here’s how to strip that data out.

First, install exiftool using this command:

sudo apt-get install libimage-exiftool-perl

Then, go into the directory with the JPEG files. If you want to remove metadata from every file in the directory, use

exiftool -all= *.jpg

The exiftool will make copies, so if you had a file called image.jpg, when you’re done you’ll have image.jpg with all the metadata stripped plus a file called image.jpg_original which will still have the metadata.

My thoughts on this week’s debate

Earlier this week I was on a search panel with Harry Shum of Bing and Rich Skrenta of Blekko (and moderated by Vivek Wadhwa) and the video now live. It’s forty minutes long, but it covers a lot of ground:

One big point of discussion is whether Bing copies Google’s search results. I’m going to try to address this earnestly; if snarky is what you want, Stephen Colbert will oblige you.

First off, let me say that I respect all the people at Bing. From engineers to evangelists, everyone that I’ve met from Microsoft has been thoughtful and sincere, and I truly believe they want to make a great search engine too. I know that they work really hard, and the last thing I would want to do is imply that Bing is purely piggybacking Google. I don’t believe that.

That said, I didn’t expect that Microsoft would deny the claims so strongly. Yusuf Mehdi’s post says “We do not copy results from any of our competitors. Period. Full stop.”

Given the strength of the “We do not copy Google’s results” statements, I think it’s fair to line up screenshots of the results on Google that later showed up on Bing:

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

and

Google Screenshot
compared with
Bing Screenshot

I think if you asked a regular person about these screenshots, Microsoft’s “We do not copy Google’s results” statement wouldn’t ring completely true.

Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:

The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. --Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”

This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “&spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places. This is at least one concrete example of Microsoft taking browser data and using it to mine data deliberately and specifically from Google (in this case, the efforts of Google’s spell correction team).

That brings me to an issue that I raised with Bing during the search panel and afterwards with Harry Shum: disclosure. A while ago, my copy of Windows XP was auto-updated to IE8. Here’s one of the dialog boxes:

IE8 suggested sites

I don’t think an average consumer realizes that if they say “yes, show me suggested sites” that they’re granting Microsoft permission to send their queries and clicks on Google to Microsoft, which will then be used in Bing’s ranking. I think my Mom would be confused that saying “Yes” to that dialog will send what she searches for on Google and what she clicks on to Microsoft. I don’t think that IE8′s disclosure is clear and conspicuous enough that a reasonable consumer could make an informed choice and know that IE8 will send their Google queries/clicks to Microsoft.

One comment that I’ve heard is that “it’s whiny for Google to complain about this.” I agree that’s a risk, but at the same time I think it’s important to go on the record about this.

Another comment that I’ve heard is that this affects only long-tail queries. As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries. To verify our hypothesis, rare queries were the easiest to test. To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings. Microsoft is the company best able to answer the degree to which clicks on Google figure into their Bing’s rankings, and I hope they clarify how much of an impact clicks on Google affect Microsoft’s rankings.

Unfortunately, most of the reply has been along the lines of “this is only one of 1000 signals.” Nate Silver does a good job of tackling this, so I’ll quote him:

Microsoft’s defense boils down to this: Google results are just one of the many ingredients that we use. For two reasons, this argument is not necessarily convincing.

First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined.

And it may also be that an even larger fraction of what creates value for Bing users are Google’s results. Bing might consider hundreds of other variables, but these might produce little overall improvement in the quality of its search, or might actually detract from it. (Microsoft might or might not recognize this, since measuring relevance is tricky: it could be that features that they think are improving the relevance of their results actually aren’t helping very much.)

Second, it is problematic for Microsoft to describe Google results as just one of many “signals and features”. Google results are not any ordinary kind of input; instead, they are more of a finished (albeit ever-evolving) product

Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?

I really did try to be calm and constructive in this post, so I apologize if some frustration came through despite that–my feelings on the search panel were definitely not feigned. Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It’s because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing’s rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says “we don’t know how much of this win came from Google” does a disservice to everyone. I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.

If you want to dive into this topic even deeper, you can watch the full forty minute video above.

css.php