In a previous post I talked a little bit about 302s. Let’s cover them in more detail. A 302 redirect can be on-domain or off-domain. On-domain is simple and not prone to hijacking, so let’s talk about that first. Suppose you go to www.xbox.com and the site does a 302 redirect to some really long url, or a url with a session ID (this used to be what xbox.com did a couple years ago. Now you end up at e.g. www.xbox.com/en-US/, but play along with me). Would you rather see www.xbox.com or www.xbox.com/home/redir/sess?session=23412341234124124231455423633 ? Yeah, I’d rather see just www.xbox.com too. That’s why for on-domain 302 redirects (that is, a redirect in which both the source page and the destination page are both on the same domain), search engines will usually pick the shorter url. Hopefully that makes sense. I’d rather see www.example.com than www.example.com/deep/home/page?last=root&sessid=909345AF2343 , and I think most people would too.
Q: Time out. I’ve got a question. What’s the deal with 302 vs. 301? What does that mean? What’s the difference?
A: The “302” refers to the HTTP status codes that are returned to your browser when you request a page. For example, a 404 page is called a “404” because web servers return a status code of 404 to indicate that a requested page wasn’t found. The difference between a 301 and a 302 is that a 301 status code means that a page has permanently moved to a new location, while a 302 status code means that a page has temporarily moved to a new location. For example, if you try to fetch a page http://example.com/ and the web server says “That’s a 301. The new location is http://www.example.com/” then the web server is saying “That url you requested? It’s moved permanently to the new location I’m giving you.”
Okay, back to our regular discussion. Now let’s talk about off-domain 302 redirects. By definition, those are redirects from one domain A.com to another domain B.com that are claimed to be temporary; that is, the web server on A.com could always change its mind and start showing content on A.com again. The vast majority of the time that a search engine receives an off-domain 302 redirect, the right thing to do is to crawl/index/return the destination page (in the example we mentioned, it would be B.com). In fact, if you did that 100% of the time, you would never have to worry about “hijacking”; that is, content from B.com returned with an A.com url. Google is moving to a set of heuristics that return the destination page more than 99% of the time. Why not 100% of the time? Most search engine reserve the right to make exceptions when we think the source page will be better for users, even though we’ll only do that rarely.
Let’s take an example from the tiny fraction of the time that we may reserve the right to show the source page for a 302 off-domain redirect. If you run wget on www.sfgiants.com, you’ll get a 302 redirect to a different domain, and the url that you’ll get is pretty ugly: http://sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf . Please set aside that you are probably a site owner or webmaster for a second, and try to step into the shoes of a regular user on the street. If we had a taste test, how many users would prefer to click on “sfgiants.com” and how many would prefer to click on “sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf” ? Normal users usually like short, clean urls. They are less likely to say “mlb.com? I wonder what that stands for? Hmm. Maybe major league baseball? Is that the officially licensed name, I wonder? It probably is. Yes, it looks like sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf is the correct url from my query.”
Now you see the trade-offs. Go with the destination 100% of the time and you’ll get some ugly urls (but never any hijacking). On the other hand, if you sometimes return the source url you can show nicer urls (but with the possibility of source pages showing up when they shouldn’t). Different search engines have different policies that have evolved over time. Over the last year, Google has moved much more toward going with the destination url, for example, and the infrastructure in Bigdaddy continues in this direction.
Let’s take a look at how different engines handle the [sf giants] query. Remember that sfgiants.com does a 302 redirect to a url on a different domain (sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf). And remember that reasonable people can disagree on which url should show up at #1. I’m not trying to criticize any search engine here, but rather trying to point out that this is a weird corner case.
Current Google behavior: we return sfgiants.com at #1. But we also return http://sanfrancisco.giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #3, as an uncrawled url, which is definitely poor/suboptimal.
Current Ask behavior: Ask returns giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #1, sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #2, and sanfrancisco.giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #3.
Current MSN behavior: MSN returns giants.mlb.com/NASApp/mlb/sf/homepage/sf_homepage.jsp at #1 and sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #2.
Current Yahoo! behavior: Yahoo! returns www.sfgiants.com at #1, but also returns sanfrancisco.giants.mlb.com/NASApp/mlb/index.jsp?c_id=sf at #6. You might think that returning sfgiants.com at #1 isn’t what Yahoo! said that they would do with 302 off-domain redirects (i.e. always go with the destination), but if you read carefully, Yahoo! also reserves the right to make exceptions in handling redirects. That allows them to show a nice url at #1.
Current Google Bigdaddy behavior (data center at 18.104.22.168): Bigdaddy managed to find a short url on the destination domain of mlb.com, namely giants.mlb.com, and returns that. We return it at #1 with no other duplicate urls on the first page.
Please don’t take me listing the current results from each engine the wrong way; I think the results from all the search engines are great for this query, because a user would have gotten to the correct final location with any search engine that they tried. This is also an unusual case where reasonable people can disagree on what the best answer is. Also, I’m positive people can find places where the Bigdaddy data center handles things the wrong way. My only point is that the new infrastructure at the Bigdaddy data center will let us tackle canonicalization, dupes, and redirects in a much better way going forward compared to the current Google infrastructure. I’m not claiming that everything is perfect in Bigdaddy, just that it’s easier for us to make changes and improve search quality as we get feedback from you.
Okay, that’s about all the background I wanted to give. Next post will call for Bigdaddy feedback.