Recently I’ve been playing with linking to specific parts of a video and incorporating YouTube subtitles. Then I realized that you could do a neat trick. YouTube allows you to create closed captioning with a simple text file that looks like this:
Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference
and we talk about something substantial, not just questions and answers, we talk through our presentation later
and it only takes a little bit of Unix command-line magic to turn that into a file like this:
<a href=”http://www.youtube.com/watch?v=Cm9onOGTgeM#t=00m07s”>Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference</a>
<a href=”http://www.youtube.com/watch?v=Cm9onOGTgeM#t=00m12s”>and we talk about something substantial, not just questions and answers, we talk through our presentation later</a>
If you run that over your entire caption file — boom — you have a clickable transcript of your video. For the text below, click on any phrase you’re interested to and you’ll be whisked away to YouTube in approximately the right place to hear me say that phrase.
Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference and we talk about something substantial, not just questions and answers, we talk through our presentation later and put it up so people can follow along, watch the slides, and hopefully learn a little bit. So today I wanted to talk about the canonical link element. And that’s something that Google, Yahoo!, and Microsoft all announced that they will support in the future at SMX West. So, the date that we had this announcement was February 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day.
So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligently designed by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard which helps people improve the web. And so we sort of said, what is a big problem that faces people today, webmasters, SEOs, site owners on the web? And it’s pretty clear that duplicate content is one of the things that people care about the most. So what is duplicate content? Well, I’ve got a slide here where I show I think eight different URLs, you know every single one of these URLs could return completely different content. In practice, we as humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it as the same page. And in practice, it usually is the same page. So technically it doesn’t have to be, but almost always web servers will return the same content for like these eight different versions of the URL. So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page, instead it’s split between a www and a non-www version. And it’s a really big headache. How do people solve this?
How do people fix this? Well, it turns out, and I’ll dwell on this slide for just a few minutes, there are a lot of ways to fix it. So, some people have joked that this canonical link element is kind of like, you know, Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot of ways that you can fix things first and foremost, from the beginning, upstream where you don’t need to fix it downstream later on. There was a really funny quote by Jill Whalen at the conference where she said, “Developers keep SEOs in business.” Right? And so whether you’re a developer or an SEO there are some best practices that can make things a little bit easier for your system so that you don’t have to worry about this issue of duplicate content at all. So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized, in essence there’s only one way to get to the content. If your content management system always generates consistent URLs, and they’re completely uniform, and you don’t have to worry about having eight different versions in the first place, that just saves you a lot of trouble. You don’t have to worry about the issue coming up at all. So one way to do that is to fix your content management system or your software so that you only generate these URLs in a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com and non-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it’s natural that search engines might get a little bit confused. So linking consistently, saying okay, my homepage is going to be www.example.com/. Nothing else, that’s it. And then making sure that all of your internal linking is consistent, that alone can make a really big difference, so that you don’t end up with two, three, four copies of each page.
If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirects to a single URL. So, it’s great if you can fix it at the beginning, it’s great if you can link consistently so the issue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it, to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect, and typically group them all together. Google also does a couple of extra things that some search engines don’t do. So, in our Webmaster Tools, our webmaster console, which is totally free, doesn’t cost anything at all, you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www, so just mattcutts.com. That’s a very easy setting, and that solves a lot of duplicate content issues right there. And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in what we call a Sitemap, which is another standard that’s supported by many major search engines, and it’s a very simple file, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves, oh, if we see a URL in that list, and then we see another version of it that’s not in the list, we will prefer URLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap. So there’s at least a couple ways that you can give Google hints that try to help out with duplicate content.
So, you know, session IDs in general if you can avoid them are great. But sometimes you as the search engine optimizer or the person who is responsible for the site can’t get rid of them entirely. Tracking codes, you know, if you’re buying ads. Analytics, you know the UTM parameter, landing pages where they have to be different landing pages for different ads, those are the sort of things that you sometimes can’t get rid of. And if you run an e-commerce site, suppose you have different products. You might have sort by descending price or sort by ascending price, and sometimes you need to have different facets, different views of your data, and conceptually it’s really the same thing, it’s just a different way to slice and dice it.
Finally, there’s breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent example via tents, or am I coming to it via colors, or did I come to it because I was interested in accessories? How did I land on this page? Even Google’s own webmaster help documentation sometimes has a CTX parameter that says here’s how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website: royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best, however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had found duplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottom I almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same page for all three of those URLs. So that was just a very simple example to illustrate that anybody can have these sorts of issues.
So what’s the answer? Lets, you know, I’ve buried the lead enough, how do people solve this particular problem? Well, assuming you can’t solve it any other way, and absolutely I encourage you to try to fix it upstream, to try to link consistently. This not something that you should just say, oh, now all my problems are solved, I don’t have to worry about anything else. But, if you can’t solve your problems in other ways, there’s a very simple element, link element, where you can say my canonical, and that’s a long word that means you know, my preferred, or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a tracking code or a session ID, it’s this pretty URL right over here. And all you have to do is in the head element of this document say you know what, even though this has a weird session ID, the pretty version, the canonical version of this URL, is over here. And that’s literally all it is. It’s a very simple open standard. It’s one simple element that you add to the head of your document. Some interesting little tidbits. This is the director’s cut so you get a little bit of extra info. Is this a tag?
Well, it’s kind of, the technical name I believe is “element.” But we’re all friends here, nobody’s going to abuse you or you know make fun of you if you call it a canonical link tag versus a canonical link element. People often speak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, if a meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it to be called “link.” And so that’s why you see link rel=”canonical” href= and the value. So now you know the official name, but nobody’s going to care if you just call it the canonical link tag. One thing that’s kind of interesting about this tag, let’s just talk about a few high-order bits.
We don’t promise we’re going to abide by this 100%. Right? You know, if we see a webmaster and they’ve accidentally shot themselves in the foot, you know maybe they’ve created an infinite loop, and it’s very easy to create an infinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this as a very strong hint. So unless we see some weird corner case or something where you’re probably hurting your own site, we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do have to reserve the final, sort of bottom-line ability to say no, we don’t think this is what’s best for the users.
Again, if you can fix it yourself upstream, that’s much better. So look at all the other alternatives, the other choices before you use this tag. Don’t just say, oh, I can just slap everything with a canonical link tag and boom, I’m done. If you’re a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software, it’s probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself, at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey, is WordPress able to add this to the core software, so maybe you don’t even need a plugin? So if you’re just a regular user and you wait a few months, things should be fine. You know it’s a brand-new element, so there’s time for you to sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it? Take a little bit of time. Don’t just jump right in and start, oh I’m going to point everywhere, I’m going to do everything. There’s enough time where this will be supported so you can plan ahead a little bit. And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or to not respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain, but we don’t allow things to cross domains. So with 301s, there’s always been this notion of can I hijack a site by doing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is not really subject to that because you can only use it within the same domain. Now a natural question right after that, is well, what about subdomains? Can I, you know, do things across different hostnames?
And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicate content. And they have a server called zeta.zappos.com, which is sort of their staging software and might be the next version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com to www.zappos.com? And the answer is yes, you absolutely can. Can you use it from https and send that to http? Totally, works great for that. It’s on the same domain, so it’s no problem at all, at least within Google to use it for that purpose. And then what’s the difference between this and a 301 or a permanent redirect? There’s really not that much, other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain.
In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini 301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s, that’s probably a pretty good guess of how we’ll handle this particular element. So, a few more questions, since you’ve got the time, you’re watching the video. Do the page have to be identical? Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sort by increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to say map this to the same URL, and don’t worry about the sort by parameter, you’re more than welcome to do that. They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse, is if you’ve got a cartoon page over here, and you’ve got something that’s completely irrelevent to cartoons over here and you try to combine them together. And you’re not really gaining any advantage because you had PageRank on this page and on that page. So it really doesn’t make sense to combine them, but we do recommend that you use them for similar pages. They don’t have to be identical, but they should be similar.
A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one. We recommend absolute URLs. And there’s a very simple reason. When you have relative URLs, you can move a URL and everything stays the same relative to that URL. So essentially, you know the homepage can say /images or images. And that will move it relative to that particular page. But it’s better to have an absolute URL because this is a powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that. Whereas if it’s relative, if you mess it up here, then you might mess it up somewhere else as well. Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects? Yes, but again I don’t recommmend that, because if you have a big site and you have a big chain of 301 redirects, it’s easy for something to break. So, it’s similar, something can break and you don’t intend to have the consequences that you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hop and that’s all you do. It’s just simpler that way, and you know you want to play it safe. You don’t want to accidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what if you say my canonical is over here, and that’s a 404 page? Right, the page might not exist. What if you had an infinite loop? This is canonical. No, this is canonical. And we’ve all seen those happen, you know, what is the Civil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War. You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops.
What if I point to a URL that hasn’t been crawled? You know, we’ll try to crawl that URL, but that corner case, what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify your canonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in the foot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has some Ghostbusters because there’s the old saying, “Don’t cross the streams,” right? So think about this, take some time, don’t just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that you don’t run into these corner cases. So we’re getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who is the Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it worked very nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what if I have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and my href is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuff still works because Joachim did a really good design, but again, try to make sure that it’s all absolute URLs and everything’s specified well. Also, I’d love to send a shout out to Greg Grothaus. It turns out when you dig into this, a lot of people have proposed similar ideas. I saw at least one post out on the general web after we’d started exploring this that said, hey, why don’t you do this kind of a proposal? But Greg was really one of the people who sparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him as at least within Google, he really got the ball rolling and really sparked the wave of work on this, so I really appreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked on the messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft, Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this. So, Yahoo! and Microsoft have announced that they will support it, let’s keep our fingers crossed for Ask, I’d love for them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tags anyway. And so it was really great that they could test it out while we were trying it out ourselves. And then a ton of webmasters who always give us this sort of feedback on what they’d like to see.
On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it. There’s an official Help Center documentation page. And, what we saw was, as people would come and have duplicate content questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you know what? We’ve got this thing coming out that might help with this. And so it was a very nice way to just do a sort of very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we were ready to announce support, and so we gave him a heads-up about the possibility of this, and he turned around plugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, which is another open-source content management system, which I think the White House just rolled out using Drupal. So really appreciate the work that he’s done as well. And in general, you know, be careful, be cautious, plan out how you want to use this tag. But we don’t intend to make any money off of it, we think it’s just good for the web, I’ll lead to less duplicate content. It’s an open standard, so any search engine that crawls the web can use this information to help, you know, make the web more relevant and increase the relevancy of their search results. And now you know as much as the audience knows when they attended SMX West. Thanks very much for listening, and talk to you soon.