Clickable transcript of my Canonical Link Element talk

Recently I’ve been playing with linking to specific parts of a video and incorporating YouTube subtitles. Then I realized that you could do a neat trick. YouTube allows you to create closed captioning with a simple text file that looks like this:

00:00:07.000
Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference

00:00:12.180
and we talk about something substantial, not just questions and answers, we talk through our presentation later

and it only takes a little bit of Unix command-line magic to turn that into a file like this:

<a href=”http://www.youtube.com/watch?v=Cm9onOGTgeM#t=00m07s”>Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference</a>
<a href=”http://www.youtube.com/watch?v=Cm9onOGTgeM#t=00m12s”>and we talk about something substantial, not just questions and answers, we talk through our presentation later</a>

If you run that over your entire caption file — boom — you have a clickable transcript of your video. For the text below, click on any phrase you’re interested to and you’ll be whisked away to YouTube in approximately the right place to hear me say that phrase.

Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference and we talk about something substantial, not just questions and answers, we talk through our presentation later and put it up so people can follow along, watch the slides, and hopefully learn a little bit. So today I wanted to talk about the canonical link element. And that’s something that Google, Yahoo!, and Microsoft all announced that they will support in the future at SMX West. So, the date that we had this announcement was February 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day.

So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligently designed by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard which helps people improve the web. And so we sort of said, what is a big problem that faces people today, webmasters, SEOs, site owners on the web? And it’s pretty clear that duplicate content is one of the things that people care about the most. So what is duplicate content? Well, I’ve got a slide here where I show I think eight different URLs, you know every single one of these URLs could return completely different content. In practice, we as humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it as the same page. And in practice, it usually is the same page. So technically it doesn’t have to be, but almost always web servers will return the same content for like these eight different versions of the URL. So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page, instead it’s split between a www and a non-www version. And it’s a really big headache. How do people solve this?

How do people fix this? Well, it turns out, and I’ll dwell on this slide for just a few minutes, there are a lot of ways to fix it. So, some people have joked that this canonical link element is kind of like, you know, Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot of ways that you can fix things first and foremost, from the beginning, upstream where you don’t need to fix it downstream later on. There was a really funny quote by Jill Whalen at the conference where she said, “Developers keep SEOs in business.” Right? And so whether you’re a developer or an SEO there are some best practices that can make things a little bit easier for your system so that you don’t have to worry about this issue of duplicate content at all. So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized, in essence there’s only one way to get to the content. If your content management system always generates consistent URLs, and they’re completely uniform, and you don’t have to worry about having eight different versions in the first place, that just saves you a lot of trouble. You don’t have to worry about the issue coming up at all. So one way to do that is to fix your content management system or your software so that you only generate these URLs in a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com and non-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it’s natural that search engines might get a little bit confused. So linking consistently, saying okay, my homepage is going to be www.example.com/. Nothing else, that’s it. And then making sure that all of your internal linking is consistent, that alone can make a really big difference, so that you don’t end up with two, three, four copies of each page.

If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirects to a single URL. So, it’s great if you can fix it at the beginning, it’s great if you can link consistently so the issue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it, to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect, and typically group them all together. Google also does a couple of extra things that some search engines don’t do. So, in our Webmaster Tools, our webmaster console, which is totally free, doesn’t cost anything at all, you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www, so just mattcutts.com. That’s a very easy setting, and that solves a lot of duplicate content issues right there. And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in what we call a Sitemap, which is another standard that’s supported by many major search engines, and it’s a very simple file, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves, oh, if we see a URL in that list, and then we see another version of it that’s not in the list, we will prefer URLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap. So there’s at least a couple ways that you can give Google hints that try to help out with duplicate content.

But, that said, there will probably always be duplicate content issues that you can’t fix. So, just to run through a few example ones. Sometimes, you can’t generate a permanent or 301 redirect. For example, at my old school account, cs.unc.edu, I don’t run the web server there. So I’d have to open a ticket or drop an email to the people that administer that system and say hey, can you add a 301 redirect from this page to that page. A lot of free hosts, you might not be able to generate a 301 redirect. And you can’t help how people link to you. So for example, you know, even if you link consistently to just the www version of your website, some other people might link to the non-www version. And you can’t really control that at all. Uppercase versus lowercase paths. Microsoft IIS will support showing pages whether you link to home.asp capitalized or lowercase, and sometimes even mixed case. And so if people link to different versions that are uppercase and lowercase mixed, that can cause some issues. Session IDs are another really big factor. So I have seen, at least in some search search engines, a site with a one-page privacy policy. And that privacy policy was indexed three thousand times, each time with a different session ID, because the privacy policy was slightly different each time.

So, you know, session IDs in general if you can avoid them are great. But sometimes you as the search engine optimizer or the person who is responsible for the site can’t get rid of them entirely. Tracking codes, you know, if you’re buying ads. Analytics, you know the UTM parameter, landing pages where they have to be different landing pages for different ads, those are the sort of things that you sometimes can’t get rid of. And if you run an e-commerce site, suppose you have different products. You might have sort by descending price or sort by ascending price, and sometimes you need to have different facets, different views of your data, and conceptually it’s really the same thing, it’s just a different way to slice and dice it.

Finally, there’s breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent example via tents, or am I coming to it via colors, or did I come to it because I was interested in accessories? How did I land on this page? Even Google’s own webmaster help documentation sometimes has a CTX parameter that says here’s how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website: royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best, however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had found duplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottom I almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same page for all three of those URLs. So that was just a very simple example to illustrate that anybody can have these sorts of issues.

So what’s the answer? Lets, you know, I’ve buried the lead enough, how do people solve this particular problem? Well, assuming you can’t solve it any other way, and absolutely I encourage you to try to fix it upstream, to try to link consistently. This not something that you should just say, oh, now all my problems are solved, I don’t have to worry about anything else. But, if you can’t solve your problems in other ways, there’s a very simple element, link element, where you can say my canonical, and that’s a long word that means you know, my preferred, or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a tracking code or a session ID, it’s this pretty URL right over here. And all you have to do is in the head element of this document say you know what, even though this has a weird session ID, the pretty version, the canonical version of this URL, is over here. And that’s literally all it is. It’s a very simple open standard. It’s one simple element that you add to the head of your document. Some interesting little tidbits. This is the director’s cut so you get a little bit of extra info. Is this a tag?

Well, it’s kind of, the technical name I believe is “element.” But we’re all friends here, nobody’s going to abuse you or you know make fun of you if you call it a canonical link tag versus a canonical link element. People often speak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, if a meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it to be called “link.” And so that’s why you see link rel=”canonical” href= and the value. So now you know the official name, but nobody’s going to care if you just call it the canonical link tag. One thing that’s kind of interesting about this tag, let’s just talk about a few high-order bits.

We don’t promise we’re going to abide by this 100%. Right? You know, if we see a webmaster and they’ve accidentally shot themselves in the foot, you know maybe they’ve created an infinite loop, and it’s very easy to create an infinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this as a very strong hint. So unless we see some weird corner case or something where you’re probably hurting your own site, we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do have to reserve the final, sort of bottom-line ability to say no, we don’t think this is what’s best for the users.

Again, if you can fix it yourself upstream, that’s much better. So look at all the other alternatives, the other choices before you use this tag. Don’t just say, oh, I can just slap everything with a canonical link tag and boom, I’m done. If you’re a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software, it’s probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself, at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey, is WordPress able to add this to the core software, so maybe you don’t even need a plugin? So if you’re just a regular user and you wait a few months, things should be fine. You know it’s a brand-new element, so there’s time for you to sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it? Take a little bit of time. Don’t just jump right in and start, oh I’m going to point everywhere, I’m going to do everything. There’s enough time where this will be supported so you can plan ahead a little bit. And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or to not respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain, but we don’t allow things to cross domains. So with 301s, there’s always been this notion of can I hijack a site by doing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is not really subject to that because you can only use it within the same domain. Now a natural question right after that, is well, what about subdomains? Can I, you know, do things across different hostnames?

And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicate content. And they have a server called zeta.zappos.com, which is sort of their staging software and might be the next version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com to www.zappos.com? And the answer is yes, you absolutely can. Can you use it from https and send that to http? Totally, works great for that. It’s on the same domain, so it’s no problem at all, at least within Google to use it for that purpose. And then what’s the difference between this and a 301 or a permanent redirect? There’s really not that much, other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain.

In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini 301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s, that’s probably a pretty good guess of how we’ll handle this particular element. So, a few more questions, since you’ve got the time, you’re watching the video. Do the page have to be identical? Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sort by increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to say map this to the same URL, and don’t worry about the sort by parameter, you’re more than welcome to do that. They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse, is if you’ve got a cartoon page over here, and you’ve got something that’s completely irrelevent to cartoons over here and you try to combine them together. And you’re not really gaining any advantage because you had PageRank on this page and on that page. So it really doesn’t make sense to combine them, but we do recommend that you use them for similar pages. They don’t have to be identical, but they should be similar.

A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one. We recommend absolute URLs. And there’s a very simple reason. When you have relative URLs, you can move a URL and everything stays the same relative to that URL. So essentially, you know the homepage can say /images or images. And that will move it relative to that particular page. But it’s better to have an absolute URL because this is a powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that. Whereas if it’s relative, if you mess it up here, then you might mess it up somewhere else as well. Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects? Yes, but again I don’t recommmend that, because if you have a big site and you have a big chain of 301 redirects, it’s easy for something to break. So, it’s similar, something can break and you don’t intend to have the consequences that you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hop and that’s all you do. It’s just simpler that way, and you know you want to play it safe. You don’t want to accidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what if you say my canonical is over here, and that’s a 404 page? Right, the page might not exist. What if you had an infinite loop? This is canonical. No, this is canonical. And we’ve all seen those happen, you know, what is the Civil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War. You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops.

What if I point to a URL that hasn’t been crawled? You know, we’ll try to crawl that URL, but that corner case, what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify your canonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in the foot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has some Ghostbusters because there’s the old saying, “Don’t cross the streams,” right? So think about this, take some time, don’t just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that you don’t run into these corner cases. So we’re getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who is the Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it worked very nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what if I have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and my href is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuff still works because Joachim did a really good design, but again, try to make sure that it’s all absolute URLs and everything’s specified well. Also, I’d love to send a shout out to Greg Grothaus. It turns out when you dig into this, a lot of people have proposed similar ideas. I saw at least one post out on the general web after we’d started exploring this that said, hey, why don’t you do this kind of a proposal? But Greg was really one of the people who sparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him as at least within Google, he really got the ball rolling and really sparked the wave of work on this, so I really appreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked on the messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft, Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this. So, Yahoo! and Microsoft have announced that they will support it, let’s keep our fingers crossed for Ask, I’d love for them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tags anyway. And so it was really great that they could test it out while we were trying it out ourselves. And then a ton of webmasters who always give us this sort of feedback on what they’d like to see.

On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it. There’s an official Help Center documentation page. And, what we saw was, as people would come and have duplicate content questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you know what? We’ve got this thing coming out that might help with this. And so it was a very nice way to just do a sort of very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we were ready to announce support, and so we gave him a heads-up about the possibility of this, and he turned around plugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, which is another open-source content management system, which I think the White House just rolled out using Drupal. So really appreciate the work that he’s done as well. And in general, you know, be careful, be cautious, plan out how you want to use this tag. But we don’t intend to make any money off of it, we think it’s just good for the web, I’ll lead to less duplicate content. It’s an open standard, so any search engine that crawls the web can use this information to help, you know, make the web more relevant and increase the relevancy of their search results. And now you know as much as the audience knows when they attended SMX West. Thanks very much for listening, and talk to you soon.

Pretty fun, right? With a little more effort, you might be able to get the links to update an in-page embedded video instead of using static hyperlinks. Specifically, there’s a “seekTo” function in the YouTube JavaScript Player API. But right now I’m too lazy to dig into it.

36 Responses to Clickable transcript of my Canonical Link Element talk (Leave a comment)

  1. Spammer. ๐Ÿ˜‰

  2. Can’t find the ‘no follows’ anywhere?? ๐Ÿ™‚

  3. You have too much spare time ๐Ÿ™‚

  4. Wow. That must’ve taken forever. Awesome.

    With all this free time, did Google fire you or sumthin? JK!

  5. Wow this trick must be the best idea i heard lately

  6. Matt
    OT – kinda – but still talkin’ canonical …

    How long after submitting a Sitemap file, with many of the URLs containing canonical links, might one expect to wait to see incorporation by Google crawl and index?

    For example, I was thinking of monitoring the Duplicate Meta tag reporting as one possible measure of Google beginning to incorporate our Canonical links. Would we see impacts in a week? Month? 24 hours? ๐Ÿ˜€

    Are there other measures we could track to try to learn when we can see some Canonical link benefit?

    thanks

  7. At least for me those links does not work… still whole video is buffering then jump to anchor place… it is not even close like google video link that jumps straight to offset and then loads just a chunk of video.

  8. Pretty cool, but I’d format those links a bit so the paragraph is more readable. A paragraph of orange underlined text is pretty tough on these old eyes.

    It’s a video promoters dream. Not only do you generate hundreds of links but with relevant anchor text. I don’t have much use for it, but someone looking for traffic to their video could take advantage by pre-formatting a linked up transcript for others to distribute when embedding their video. It would sure make Google’s job easier of making videos searchable, far better than relying on the small little description and mostly garbage comments that are generated.

  9. I thought you were going to have lots of links

    Here you have lots of links to the same page, with a different “Fragment” or “named anchor”
    http://www.mattcutts.com/blog/seo-glossary-url-definitions/

    It was my understanding that for SEO purposes, Google would only count a single link to a particular destination.

    Thus the first link to the video uses the following anchor text “Hi everybody. Welcome back to another video. Weโ€™re doing this thing where when we speak at a conference”

  10. Charlie Anzman, I’m happy to vouch for the content, so no need for a nofollow. Barry, you saw my previous post, right? ๐Ÿ™‚

    Robert, I don’t have a lot of free time. Honestly, all the work was in making the original subtitle file, which was a product of http://www.thewysz.com/ (a fellow Googler). After that, it was just about 5 Unix command lines. Making timecode/close captioning files are pretty labor intensive though. What would be cool would be voice recognition that would create a “rough draft” timecode file. Even if the actual words weren’t right, getting the timing info about your voice would still be a huge step forward. I don’t know if anyone (at Google or not) is tackling that as a project though.

  11. John, agreed that it would surface more content in videos and make it more easily crawlable. Double-plus agree that the formatting is a little screwy. I looked for easy-for-me linebreaks instead of breaking at natural paragraphs. But I just wanted to demo the idea.

  12. Hey Matt,

    Think about the vision of senior users when writing such clickable rich post. I guess neither your Dad, your Mom or I could read such a post without taking a nap in between ๐Ÿ™‚

  13. Interesting stuff – do you think that this could have an effect on Universal search results for a particular fine tuned search query?

  14. Very nice hack.

  15. Clever use of the caption file. I imagine this would be a great use for the disabled.

  16. I have nothing of value to add, other than re-stating the obvious: This is awesome!

  17. That looks cool, but you are surpassing the 100links/page rule ๐Ÿ˜› ๐Ÿ˜› ๐Ÿ˜›

  18. Hey Matt,

    Another great video! Thank you to everyone who put the time into making this possible. I really respect you and Google for trying to make the web a better place. I wish you had more time to share your knowledge, I always learn alot and once again thanks.

  19. This is a pretty intense sitemap, I will be watching how this effects the rankings. ๐Ÿ™‚

  20. Neat, very neat. I really appreciated the transcript, I read many times faster than the spoken word so I hardly ever click on video’s.

    But I was reading the transcript and thinking: that must have been so much work.

  21. Dave (original)

    Definately more than 100 links in this post alone.

  22. I don’t believe my eyes ๐Ÿ™‚

    I also checked your source code ! The idea is great ๐Ÿ™‚

    But unfortunately Youtube is blocked in Tunisia.

    And it’s very bad to hear that Google Video will soon disable files submissions ๐Ÿ™

    We have to look for alternative video publishers. So please Matt do we have the same video ranking chances when using wat tv, kewego or vimeo…?

    Thank you

  23. Hey I read this at work, and immediately had an idea:

    Why not get even more granular, and try to estimate the correct position between notations?!

    So, I frankensteined together this example here:
    http://scriptedlife.com/data/highlight.html

    It uses a javascript function to track what text is selected, find where it should be in the video, and send the embedded clip to that point, its buggy, but kinda fun just to go around the page highlighting text and seeing how close it gets.

    Thanks for the inspiration!

  24. Pretty cool, its a nice way to promote your youtube videos and profile.

  25. @Matt: Nice idea

    @Andy: This is a very nice piece of Frankensteined code !

    So whatยดs next ? A service thatยดll:
    – automatically search for (and find) transcripts for talks on Google Video and Youtube,
    – combines the two and dynamically whips up a page like the one Andy has created above

    Whoยดs first ?

  26. Matt: is there any way you could hyperlink just the first word of a passage or something? That’s mega-hard to read.

  27. This is very cool. It will change the game for video SEO when Google rolls out automatic closed caption capture for all video and audio files and then uses that audio text as part of the indexing algorithm just like you do for web pages now.

  28. This post looks very spammy……and irritating too…

  29. Wow, I did not even think that this will pass as acceptable for google. It is not only awfully painful to the eyes, but also packs a ton of outbound links on a single page. I know that a human will immediately know that this is some content rich page, but with the Google algorithm not really involving humans, I don’t see how their computers can identify this as a non-spam.

    You’re just full of suprises man.

  30. I definitely wouldnยดt recommend doing that on a brandnew site.

  31. Kudos for your efforts, Matt, but somewhat puzzled…what was the purpose? Missed that somehow…

  32. Thank for the info. That the sitemap sorts out some of the duplicate content issues help me to understand all of this.

  33. Matt,

    This labeling is a great invention i believe.

    I have a question,

    Is Google looking to generate video search results from these captions, from time intervals and captions specified on that time ? If these captions are crawl able.

    My question is for videos which are optimized and not optimized both ?

    If it’s the scene then i think its great for content of those users who are ignorant of the techniques and usage of optimization standards.

  34. I am amazed with the video links. I would spend a couple of hours all day to upload so many videos like that and I will need more times to create the hyperlinks on the post. It’s really a great effort ๐Ÿ™‚ and I believe that people do not like too many videos with a rapid link on a single page, oh no, on a single post, exactly ๐Ÿ˜€

  35. Interesting although I agree it’s a lot of links. I think being able to just link the beginning of the most major points of a video would be useful. Sometimes I really don’t want to watch an entire video just to see the last portion.

  36. Honestly. Tell us how long it took to link all that together?

css.php