Generic Toolbar Indexing Debunk Post

Sometimes people think that the Google Toolbar led to Google indexing a page. Here’s a recent such story, for example, which speculates how urls with the substring “mms2legacy” got indexed. Here’s where I started to disagree:

The reason for this [supposedly unlisted urls getting crawled –Matt], explained Ken Simpson, CEO of anti-spam company MailChannels, is that one’s Google Toolbar may be configured to pass URLs that one visits to Google for indexing. “If you run Google Toolbar, it knows pages you visit,” he said.

Sorry, but if Ken Simpson is implying that the Google Toolbar led to these urls being crawled, then he’s mistaken. Let’s take the first result from the [inurl:mms2legacy] query given in the article. The first url in that result set that I saw was http://mediamessaging.o2.co.uk/mms2legacy/showMessage2.do?encMmsId=F1ABCF6D326A3F65 . Well, if you take the string F1ABCF6D326A3F65 from that url and search for that then you’ll find multiple references to that url. In the cases I looked into, we found these pages via someone publishing a link on http://my.opera.com or other places around the web. I can definitively say that all the urls I looked into were discovered via crawling regular old links.

Folks with great memories may remember that I’ve talked about this before. Back in 2006, both Philipp Lenssen and Google OS did controlled experiments by visiting unlinked deep pages with the toolbar, and both concluded that the toolbar did not lead to those urls being indexed.

It’s good to reiterate this every couple years though, especially as Google has gotten better at finding new pages as it crawls. We get questions like this often enough that we have an FAQ answer about it:

Why is Googlebot downloading information from our “secret” web server?

It’s almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your “secret” server to another web server, your “secret” URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if there’s a link to your “secret” web server or page on the web anywhere, it’s likely that Googlebot and other web crawlers will find it.

Security through obscurity is not a great way to keep a url from being crawled. If you don’t want your content in Google’s web index then we provide a ton of advice on how to prevent that content from getting into Google.

37 Responses to Generic Toolbar Indexing Debunk Post (Leave a comment)

  1. Hi Matt,

    I know you are the guy at Google and I am not but if the Toolbar does not influence crawling today it must have in the past. I (we) made the experience in the past that if we did surf to pages which did not have any incoming links for 100% they had been visited by the Google Crawler a few minutes after visiting the URL with the Toolbar “in Action”.

    As we do use the Google Toolbar since the beginning I cannot explain how else those pages got crawled a few minutes after a visit and later on in the index in the past. Just to be fair, I am not 100% up to date on this.

    Thanks, Michael

  2. Matt,

    Those recent “debunking posts” are very interesting and educating at the same time. Keep up the good debunking πŸ™‚

    And of course, no harm in sending me some Sphinn Love at your convenience πŸ™‚

  3. What about links in Gmail? Do you use Gmail as a source for new links?

    I highly doubt it, but it is a mystery to me too sometimes how Googlebot finds pages. A folder ~projets for example in which you put some sites you work on to show customers progress, are often found, and without any links to it. If you could use an example like that, it would be appreciated.

  4. I agree. I am writing an article about Glasnost/Perestroika and the NEW Google policy of open-ness. I am also going to buy Google stocks. Did you guys hire a new CEO or what? Good work Matt.

  5. Peter (IMC), to the best of my knowledge we don’t use any urls from Gmail either. That’s the sort of thing that you could easily run a test on as well (email yourself a reference to a deep url that is unlinked on the web). I’ve already privately debunked one incident where a person thought that emailing a link got a page crawled.

    I (we) made the experience in the past that if we did surf to pages which did not have any incoming links for 100% they had been visited by the Google Crawler a few minutes after visiting the URL with the Toolbar β€œin Action”.

    Michael, let me try to convince you another way. We crawl millions (billions?) of pages a day. People probably surf billions of pages a day with the Google Toolbar on. It’s inevitable that sometimes by coincidence we’ll crawl a page soon after someone with a Toolbar installed surfed it. But correlation does not apply causation. I’d refer you back to the links where Philipp Lenssen and Google OS both did controlled experiments on deep unlinked urls and saw that the toolbar didn’t lead to those pages being crawled.

    By the way, this law of large numbers is also why I end up debunking “I started/stopped/increased/decreased my AdWords spend and saw my organic rankings increase/decrease.” Google’s rankings often change as we find new pages/links and the web itself changes. And lots of people are always tweaking their AdWords spending up/down/on/off. So it’s inevitable that you’ll have a few people increase/decrease their AdWords spending and see their rankings go down/up. That’s probably the most common misperception I end up debunking.

    Here’s another view. Suppose the odds of it raining on any given day are 50/50. If you washed your car eight times and all eight times it rained, you might suspect that you washing your car caused it to rain. But the odds say that 1 in 256 people will have that happen to them (correct me if I’m off by a power of two). The numbers of people advertising or surfing with the Google Toolbar and the number of rankings changing or pages being crawled are much larger, so it’s inevitable with large numbers that you’ll see a few cases that look suspicious but are just pure chance being played out. Mathematically, it would be more suspicious if you *didn’t* see a few things that look like strange coincidences.

    Thanks, Harith and panzermike. πŸ™‚

  6. I’ve heard that one often Matt, a lot of otherwise intelligent SEOrs have this idea in their head that google factors in the amount of times a url is sent in emails via GMail. Firstly that would obviously be wrong, because spam urls would be amongst the top results. It’s not often that you see herbalife sponsors and penile enlargers topping the results for Enlarge Image and other non-related queries.

    One thing I’d like you to address is whether any data obtained from the Google Toolbar is used in ranking sites? For example does Google use traffic data obtained by the Toolbar to rank sites, eg: the more visits a site receives as measured by the toolbar influence results? Also with Google Analytics, if I allow Google to use the Analytics data will this influence my results – I’d like to hear an official answer rather then mining the privacy policy and terms of use – so far I’ve been avoiding merging my data out of fear that one may influence the other – eg: Google may in the future want to rank other sites above me because they get more traffic and thereby are considered more important?

  7. Hi Matt,

    I read through your post and the cynic in me only sees you debunking this particular incident, so perhaps you can humor me. Can you please specifically state that there is no way the Google Toolbar can lead to a web page being indexed.

    Personally, I really don’t care, but it seems your post leaves this statement out.

    Cheers!

  8. Matt,

    That’s all well and good, but your post titled “What should NOINDEX DO?” has shown that Google is willing to ignore their own guidelines and include pages in the index even when the webmasters specifically requests you not to.

    That type of inconsistent “google knows best” behavior is what fans the fires of cynicism.

  9. Hey Andy, I tend to stay away from categorical assertions (“Google will never do X”) or hard promises about the future just to be ultra-safe. But anyone can email themselves links or surf with the toolbar to unlinked urls to verify this themselves.

    Travis Lane, correct me if I’m wrong, but the β€œWhat should NOINDEX DO?” post came out exactly how most webmasters wanted it, right? We polled for extra feedback. Most webmasters wanted NOINDEX to completely block a page from showing in Google. And that’s what we do. I would say rather than “Google knows best,” that post showed that we genuinely wanted to get a sense of what the webmaster community wanted, and we ended up continuing our NOINDEX policy that webmasters preferred over MSFT’s and Yahoo’s policy.

  10. Ignore this, just testing the escaping in formatting.php–to see whether it will — escape things.


    Cool.

  11. This is partly devil’s advocate question, but partly a question I’m going to ask because I know others will want to know it.

    Let’s take a new site (e.g. one under construction). We’ll call it NSUC.com.

    NSUC.com has no inbound links whatsoever…in fact, no one knows that the site exists.
    The site was never submitted to Google.
    Google Toolbar doesn’t index pages that it finds.

    So how are those pages found, and assuming there is no bot-blocking, indexed?

  12. Multi-Worded Adam, someone could have submitted it via the “add url” form. Someone could have surfed from that domain and left a referrer on the web as a result. I could brainstorm some other (non-Toolbar, non-Gmail, non-Google-specific) ways to discover domains with you sometime if I ever see you at a conference.

  13. MWA

    Long time no comments. How come πŸ™‚

  14. Easiest way of not being in the google index is to get a spammer to fill up a file with 4152 spam short links and bury it deep in your web site at say http://www.yourwebsite.com/wp-includes/js/tinymce/themes/advanced/images/xp/ then your website will just disappear from googles index. sigh πŸ™

  15. Interesting that its O2 messing up – in BT they wern’t top of the internal pecking order in terms of technical chops.

  16. MWA

    Long time no comments. How come

    Been busy. Still am. That’s about all that I can say.

    I could brainstorm some other (non-Toolbar, non-Gmail, non-Google-specific) ways to discover domains with you sometime if I ever see you at a conference.

    Find me one in Toronto that I don’t have to pay for and I’m there. πŸ˜‰

  17. Matt, the fact that you have to make this post suggests that Google still isn’t on top of filtering out webstats links. It’s astonishing how many newbie webmasters publish their stats pages, unaware that referral spammers are exploiting this. My guess is, this is by far the main way that “super secret” urls get indexed.

    Perhaps you need to take an audit of all of the popular stats packages that might be indexed, identify their footprints, and discount all of the links coming from these pages.

  18. “Here’s another view. Suppose the odds of it raining on any given day are 50/50. If you washed your car eight times and all eight time it rained, you might suspect that you washing your car caused it to rain. But the odds say that 1 in 256 people will have that happen to them (correct me if I’m off by a power of two).”

    Hi Matt,

    It already does not sound right.. I’d also be careful with calculating odds and making assumptions based on them. Stats and figures can easily fool you. I suggest watching Peter Donnelly’s presentation at Ted.com to this subject. Very enlightening.

    http://www.ted.com/index.php/talks/peter_donnelly_shows_how_stats_fool_juries.html

    p.s. Did you move to Seattle? πŸ™‚

    I’d say that eight times no rain when you wash your car in Fresno,CA during the months of July and August is the reason for not raining. The using of the water for the car wash causes a drought. Drought = no water in the air = no rain. hehe.. So somebody is washing his car on every given day, because it never rains in July or August here in Fresno. hehe.

  19. Maurice, I think that O2 is just making the photo urls unlisted. But then if you link to those photo urls (as some people do), we’ll often find and crawl those links.

    Carsten Cumbrowski, I already set the ground rules that the odds of rain were 50/50. Technically this type of logical fallacy is known as the “Post hoc, ergo propter hoc” fallacy (after this, therefore because of this). See, I did pay attention in those English classes in high school. πŸ™‚

  20. Hey Matt,

    Any idea about Google Analytics?

    If I have Analytics installed on a site that is not “live” yet – e.g. no known inbound links – and I visit the site directly – would Analytics add the URLs to the crawl list?

  21. Hi Matt,

    Do you does the same apply to GTalk as Gmail? Would I be safe to assume this?

    Thanks!
    Nick

  22. Hey Matt, while you’re debunking (sorta), Google’s internal
    capture and use of data, could you definitively state that
    Google Analytics is Not transferring data to the page rank,
    or Adwords score engines?

    I’ve defended Google on this. Please don’t make me look stupid

    Thank You

    @AndyBeal I don’t think he answered you. ;~)

  23. where did the smiley faces go in google toolbar?

    lol..

  24. Great post and love how you share the “debunk the myth” type information. I agree, you can’t make categorical declarations that Google will never do X, and your information is clearly not that by any means.

    Thanks!
    Maria Reyes-McDavis

  25. Matt,

    Does the Googlebot crawl Alexa and Compete? Then, in theory, any toolbar that checks Alexa or Compete rank can pass an unlinked page to those systems, and thus get a page indexed by Google — albeit indirectly.

    Analyzing log files, we’ve seen unlinked pages get a single visit from a browser, and within 24 hours, are visited by ia_archiver (Alexa).

    I think this falls into the “interesting but unlikely” category. Unlinked pages rarely get enough traffic to be listed on Compete and Alexa, however, the route exists and may warrant a no-index for pages that folks want to keep out of the index.

    Thanks,

    -Michael

  26. For the record.

    Just wish to add another debunking which Matt has posted on 21st July 2008 in reply to a post on Techdirt; Why Is Google Punishing Sites That Publish Full RSS Feeds? [UPDATED]

  27. Nick Stamoulis, Ed Shaz/, and Charles, I wouldn’t claim to be the expert on every Google property, but (without chatting with either team) I very much doubt that Google discovers pages from either Google Talk or Google Analytics. In addition, I’ve personally promised that my webspam team wouldn’t go and ask the analytics team for data, and webspam has held to that promise. I know that the Analytics team has a strong privacy policy, and if people want to share (I believe anonymous) data, then there’s an opt-in in Google Analytics. But even then, I don’t believe that data is used in Google’s crawl.

    Michael Stebbins, http://www.alexa.com/robots.txt forbids Googlebot from crawling several directories, so we don’t crawl those directories. I took a quick look at Compete and just from surfing around didn’t see any places that would lead to deeper crawling by Googlebot, although I didn’t check everywhere.

  28. Dave (original)

    Come on Matt, must you keep spoiling good stories with facts and common sense? πŸ˜‰

    Most SEO having been claiming for Years that Google finds pages via their Toolbar. 99% of them will continue to spread the myth (along with many others, e.g “Sandbox”, “aging delay” etc) over facts and common sense, after all, it’s what they do best and what the industry is built upon.

  29. Dave (original)

    WOW.. you are back (from vacation, may be) πŸ™‚

  30. Dave (original)

    Yep and it was my best of my life. Bora Bora, Tahiti.

  31. Hi Matt, just going back to Peter (IMC)’s question on G-Mail – I use the alerts system from Google on all of my anchor text and keywords. In fact I was one of the first to use it on SEO so I get a heck of a lot of URL’s sent to me daily and all through G.Mail (LOL) it has not done my SERP any harm! Good info in this thread – many thanks David

  32. Dave (original) Said,
    July 21, 2008 @ 11:49 pm

    …. facts and common sense….

    I just love it when you use those words,.. πŸ™‚

  33. Matt, but there might be one way how an Adords buy could influence ranking indirectly.

    Some assumptions:

    1. Google sometimes ranks pages better if G’ has more knowledge about this page and this information is fresh; this is documented in Google’s patents but this doesn’t mean Google has chosen a setup of the running system that follows this approach.

    2. Google crawls pages immediatelly if the URL is the target of an Adsense campaign; this leads to a very fresh information about the page.

    3. Google shares information about pages between its different services (you described the caching system in another post).

    Conclusion:

    A campaign targetting a page on your domain might increase indirectly the ranking of this page, this is a side effect of the design of the complex google system.

    I did some testing with small, medium and large sites and did some exchange with webmasters of other sites and I think that the effect is very low for small and medium sites (you are very good in keeping them fresh in your index) but it could make a difference for very large sites (with complex structures where you might be not as good in keeping them fresh completely).

    Btw. the opposite proved to be true too: if the content of the landing page of the campaign was low quality then running a campaign against it reduced its ranking. Conclusion: the fresh and deeper information lead to a worse ranking.

    Finally: advertising against a high quality page leads sometimes to a better ranking. This is a result of the complex Google system.

  34. Matt I have a related question. There has been talk of Google possibly using the bounceback rate or how long users stay on pages as part of determining how relevant a page is to a particular search phrase.

    I am not sure Google currently does that or really plans to, but if they do or plan to, how will they gather that data?

    Personally I think it is a great idea. Tracking people who type in a search phrase, then adding in how long those people spend on the web page would be using a sort of social aspect to ranking pages.

    So to be clear, Is Google doing this or planning to? and How will that data be tracked? and Would the Google toolbar possibly be used to track how long people spend on a particular web page?

  35. Thanks for the info. How would you respond to another BIG rumor on the web that Google Ad Planner gets its info from its Toolbar? how come Google is so silent about the data source?

  36. don't be evil?

    These mysterious indexation of Google as mentioned above seems (to me) coming from Google Adsense Program. It says in Terms of Condition that Google indexes and caches all or part of URLs of websites participating in Google Adsense Program. In a sense, it’s more problematic than indexation from site-users via Toolbar, in that users hardly have any choice to avoid Google adsense program (users won’t know unless the Google ad pops up, and when it does, it’s too late). In addition, Googld ad planner also takes data from Googld adsense program.

    I personally think it’s pitty that Google is so silent about its data collection.

    Matt, correct me if I am wrong. I hope I am wrong!

  37. But it’s not always “sites visited” that show up.. From what I’m being told by a webmaster who’s password protected pages are showing up on Google, it’s people’s bookmarks that are passed on to Google by the toolbar, it’s not their browsing history.

    I have no idea whether this is correct, but at least the argument put’s a slightly different spin on it.

css.php