Solved: another common site review problem

Okay, go read this post on the Google webmaster blog. In fact, if you read my site, you really should add the Official Google webmaster blog feed to your list of subscriptions, because that blog is almost 100% SEO/webmaster-related posts, and it is official. Done reading? Okay, I’ll give you my personal take on why I like this idea.

I’ve done a lot of site reviews in my time. Many of them go like this:

Webmaster: Matt, can I get a site review for ExampleCo??
Me: Hey, I’ve heard of Example. I really like your red widgets.
Webmaster: Thanks! We’re rolling out a new line of blue widgets this fall. The site is example.com.
Me: Okay, let’s take a quick look.

(small chat about blue widgets until the site loads.)

Me: Huh.
Webmaster: What? What does “Huh” mean?
Me: Well, when I visit www.example.com I get map of the world and then at the bottom of the page there’s a dropdown to select which country version of Example to go to next.
Webmaster: Right. Example is a big business with lots of different country-level domains, so we have to ask the user where they want to go. Why, is that a problem?
Me: It sort of is. Dropdown boxes and forms are kind of like a dead end for search engine spiders. Historically we haven’t crawled through them.
Webmaster: But it’s just a dropdown box with ten countries listed. You can’t just crawl that?
Me: Not really. Think of search engine spiders much like small children. They go around the web clicking on links. Unless there’s a link to a page, it can be hard for a search engine to find out about that page.
Webmaster: But it’s just ten countries. Couldn’t the search engine just pick one of those values and keep going?
Me: In theory you could do that, but in practice the major search engines don’t usually do that.
Webmaster: That sucks. I like how clean the page looks. Is there a way around that?
Me: Sure. You could put the list of countries at the bottom of the page and make them hyperlinks so that Googlebot can crawl through to the other urls. A good rule of thumb is to take a look at your site in a text browser like Links or an ancient browser with JavaScript/CSS/Flash turned off. If you can reach all your pages just by clicking regular links, your site should be pretty crawlable.

I’ve had this conversation a lot over the years. Savvy webmasters and SEOs know how to make a site crawlable, e.g. making sure that someone can reach every page on a site via normal HTML links. But the web is filled with sites that have a dropdown box or some other form that search engines typically didn’t know how to handle.

Now Google is finding ways to crawl through forms and drop-down boxes. We only do this for a small number of high-quality sites right now, and we’re very cautious and careful to do the crawling politely and abide by robots.txt. If you’d prefer that Google not crawl urls like this, you can use robots.txt to block the urls that would be discovered by crawling through a form. But I hope that the dialog above is a pretty good example of why this new discovery method can be helpful to webmasters.

Danny asks a good question: if Google doesn’t like search results in our search results, why would Google fill in forms like this? Again, the dialog above gives the best clue: it’s less about crawling search results and more about discovering new links. A form can provide a way to discover different parts of a site to crawl. The team that worked on this did a really good job of finding new urls while still being polite to webservers.

By the way, I wanted to send out props to a couple people outside Google who noticed this. Michael VanDeMar emailed me a little while ago to ask about this, and Gabriel “Gab” Goldenberg recently noticed this behavior as well. I appreciate them discussing this because it encouraged Google to talk about this a little more. πŸ™‚

40 Responses to Solved: another common site review problem (Leave a comment)

  1. Is this to say that the sites on the SERPs that have several links under the main link – is an example of this new technology.

    http://www.google.com/search?hl=en&q=belts

    In other words, You will have a listing at the top (usually the main homepage) … then, underneath will be two small rows of links to other pages in the same domain. Sometimes there will be a search box

    Is this the result of this new technology?

    Also, what is meant by ‘high-quality sites’?

    Is this a subjective determination, or determined by algos

    Thanx πŸ™‚

  2. Miss Universe, no, what you are showing is examples of what are referred to as “sitelinks”, extra navigational links that Google shows on certain sites, as long as the query was relevant to all of the pages shown.

    What Matt is talking about is say for instance you have a search box on your site, such as WordPress comes with by default, so people can search your site using the WordPress engine, and not just rely on using Google to find pages. What Googlebot has been doing is taking words and phrases from the context of the page, and essentially filling in that search box, looking for new pages.

    Matt, thank you again for explaining this.

  3. Hey Matt,

    Thanks for posting this explanation and clarifying stuff for a whole bunch of folks. The link is also much appreciated :D.

    BTW: For you sphinners in the audience – drop your vote here if you care to:
    Matt Cutts Explains Site SERPs in Google SERPs

  4. So how does a spam blocker (ie. what is the sum of 1+1) effect it. Does the bot just enter random numbers?

    Because of the spam, I’ve had to put spam challenges on every form I’ve done.

    Depending on the mod_security rules on the server, if the bot keeps randomly trying numbers to validate the form, it might even get the bot IP banned from the server.

  5. There are many sites that would benefit from this. While I don’t like quick pick drop lists, they are common for many industries. Great to hear Google continues to evolve its indexing paths to get to the good content. Don’t forget to update those robots.txt files!

  6. spamhound, that would probably keep Googlebot from crawling that form as well. The amount of activity from Googlebot should be so light that I’d be very surprised if it showed up in any significant amount over the course of a day.

  7. Thanks for the update and pulling the two concepts of form crawling and the mysterious search results crawling together. I’d always dismissed the search form crawls as from errant links as Googlebot didn’t have creativity and couldn’t fill out the form…when will she become self-aware? Should I sell my stock in Skynet and invest in GOOG?

  8. Matt, one of the first things I have read online was Google’s Webmaster Guidelines and they clearly state that we should build sites without thinking about Google and everything should rank where it should. If I decide to have a dropdown menu and you say “but in practice the major search engines don’t usually do that.”, doesn’t that force webmasters to design and structure their sites thinking about Google and optimization in almost every aspect of creating a site?

  9. @ Mike Dammann

    Google isn’t trying to access pages that require login information or anything like that with this new capability. That being said, if the pages on the other end of the form are meant for your visitors to see then why not have Google see them as well?

    If the pages aren’t meant to be seen by Google then password protect them, slap up a captcha or take a minute to update your robots.txt.

    Google isn’t responsible for our lack of safeguarding against Googlebot or other spiders crawling the web. Personally I think the only reason I’ve seen so many complaints about this new capability is because it is Google that is now doing it. If it was Dogpile or some other second hand search engine this wouldn’t be getting as much attention. Just a thought though…

    John Jones

    – 10 minutes of SEO, SEM & Internet Marketing

  10. Mike,

    Site owners who do things like offer nav through drop-down menus combined with Javscript, on average, probably make some sort of sitemap that’s linked to with a regular hyperlink. That’s for users, too so the experience can degrade gracefully and more users can be served (traditionally including gbot)

  11. Wouldn’t an XML sitemap help in this particular scenario? I mean, if for whatever reason I didn’t want Googlebot crawling through my forms.

  12. Guys, I haven’t said one word about required login, I am talking about why Millions of webmasters need to do extra work which makes their sites less attractive rather than Google finding a way to advance in a couple of areas.

  13. Mike Dammann, your question makes an admirable point. Designing for good accessibility is often the same thing as designing a site that works well in search engines, e.g. a user that comes to a dead-end and has to search to find other parts of your site may often leave your site instead of searching (possibly many times) to find your content. So providing easy linkage to all your pages can be good for both users and search engines.

    But no matter how good an idea it is to make all content accessible via simple links, some site owners won’t do that. Those are often the same site owners that put their address into an image instead of text. πŸ™‚ So Google tries take the web as it is and discover new ways in increase our relevance; we know that not every webmaster will provide links to all their pages.

  14. Dave (original)

    Matt, out of curiosity, has the advent of Google sitemaps resulted in Google finding a lot of new pages that they would have NOT found otherwise?

  15. Dave (original), it’s more to help prioritize between different urls that we already know about than to discover new urls.

  16. Dave (original)

    Thanks Matt.

  17. Could you explain a little more about the other part of the announcement — that the Googlebot can now find URLs in javascript. How long has that been going on?

  18. This makes sense. Last week on a site I administer and has a PR of 7 I discovered (accidentally) that Google indexed like 10 redirect urls that were generated on a purchase form. I solved it by adding the landing url to the robots.txt but what amazed me is that the form action was an javascript function, does Google also go through javascript to discover the urls or am I missing something?

    You might want to consider adding a “noform” parameter or something in the webmaster tools to avoid having forms indexed, as in my case for instance I don’t have any forms on the websites I maintain that should be crawled, because the forms actually gather information and redirect to a url with specific parameters. Plus, a webmaster that knows a little about seo uses sitemaps anyway because it’s in his interest to have the site fully crawled. Just a thought.

  19. What if I have a “Thank You” page for an opt-in form where I offer a freebie to my subscribers? If Google ends up indexing the thank you page, won’t the freebie be accessible to even the non-subscribers?

    The above is just an example. There are several other instances where it would be undesirable to let the page be accessible to those who haven’t filled the form.

    Yes, while I could stop Googlebot from accessing such pages using robots.txt, CAPTCHA, and such, why should I be expected to? I put the page behind a form for a reason.

    Forms are meant to be filled by humans, not bots. The page at the other end of the form is/may be the result of the values the HUMAN user supplies via the form.

    Sometimes a webmaster may store the results of form submission and analyze the data for trends, etc. But now that Google is letting the user bypass form submissions, webmasters can’t have this data.

    The excuse that they are “trying to improve our coverage of the web” (from the original post at the Webmaster Central Blog) is a feeble one at best.

  20. Thanks for the link to the webmaster blog post. There are some very good points in there. I agree with your points about making sites not only friendly for visitors but friendly for search engines also. It is nice to see that google continues to develop new ways to detect internet information also…

  21. A couple of questions.

    If a sitemap has been submitted to Google then why do you not respect the fact that the webmaster has already told you what to crawl?

    Also, on a small site with a well laid out structure that has relatively few static pages with lots of content and a search box then I would have thought that your ‘random search’ indexing would only serve to destroy the balance’ of the structure of the site away from being highly structure to completely random. Surely this has implications for how Google ‘sees’ the site for ranking purposes?

    Thanks in advance.

  22. Well, if nothing else, I am bouyed that you guys (correctly) considered my high-quality site as, high-quality πŸ˜‰

    I too noticed you were indexing from a drop-down around 3 (or possibly more) months ago. I didn’t realise it was hush-hush news, though, else I would’ve hollered for kudos value (and links). πŸ˜€

  23. @ Claudiu: Check Google Webmaster Central’s post. They are indeed running through javascript.

  24. Hi Matt,
    Whilst I understand the ‘spirit’ of the plan – the reality of Google filling in some ‘get’ forms, and then indexing the resulting content pages may end up potentially cause more grief than it solves – for users and site owners.

    1. Many dynamically, application generated pages use cookie data so users can make better benefit of the site, (e.g. go back to earlier searches etc.) And for Cookie rejecters (like Googlebot) – many of these sites insert the cookie data into the URL.
    So Google will index the resulting dynamically generated URLs (complete with Cookie data) – so it will never actually get the same URL twice… i.e. duplicate content/ indexing URLs which when clicked on by users return 404s due to expired cookie data in the URL etc

    2. The vast majority of these ‘get’ form generated ‘results’ pages have no links to them. So even if they are indexed – they will only ever potentially be returned in the SERP rankings for the very very very long long long long tail.

    No links = no rankings on any kind of moderately competitive phrase.

    3. Most sites using ‘get’ forms should/ would have already built a static crawl path/ Google Sitemap etc to the content they want indexed, in a URL format that allows linking / indexing (i.e. they have already solved point 1 above)

    So Google being able to index ‘random’ content generated by the application could be counterproductive (as potentially a static URL & crawl path to the ‘selected’ content already exists) – and that just means duplicate content….

    Unless I’m misreading this, it looks like whole lot of work for what will potentially provide a pretty poor search result for Google users; and cause grief & worry and duplicate content issues for siteowners….

    Wouldn’t it be a better use of resources to concentrate on removing MFAs from the index, than potentially bloating it with more dup content?

  25. Never knew that drop down boxes are zero in the eyes of search engines..
    Then what would you have to say about the flash websites full of drop down boxes and flash content, still ranking good in search engines..

  26. Hello Matt,

    Just a quick question..

    You mention “you can use robots.txt to block the urls that would be discovered by crawling through a form.”

    If I understand correctly, this means Google would crawl but not display the results.

    What if a website would like to dissalow the bot from searching through a particular form all together? In our situation the search form leading to flight results is chargeable per search, thus we would not want to have excess searches..

    Thanks!

  27. Dropdown-powered navigation is quite likely (it seems to me) to rely upon Javascript to work.

    Would the Googlebot follow regular links in a element that JS-using browsers would not see? Such links would also serve a good accessibility purpose, whilst appeasing the “I like it looking clean” people.

  28. Matt I have a question … I found a domain which is nearly 10+ years old of now … Last week i was just browsing that website and noticed that it have multiple mirrors … like:
    domainname.com, domainname.net, domainname.org, domain-name.com, domain-name.net, domain-name.org (All domains are 5+ Years Old and have enough backlinks).
    The strange thing all of them have same content on them .. You can say they are parked … and all of them getting ranked for diff and sometimes for same keywords … don’t u think its strange ?? or google don’t want to give them penalty because they are 2 old and have great trust on there domains with 2many backlinks ?
    For some reasons i can’t tell the domain name in public !!

  29. Solved: another common site review problem.

    Caused: another 10 common site review problem.

    MATT while I know no one will read this, let alone reply.

    As someone who knows Google’s selection on where to apply this ‘new’ technology is not manual ( neither sane ) why didn’t you make this OPT-IN ?

    As mentioned on WW, Google has educated people to use forms to HIDE things they didn’t want Google to see at all. Now all of a sudden this system comes along and grabs things never meant to be indexed.

    This move is a little less justified than as if, let’s say, Google from tomorrow on would crawl, index and even rate javascript, AND nofollow links.

    What’s your take on the damage this will cause?
    This sudden one-sided change in an unwritten policy ( rule ) of the web ?

    And… again please note, that the sites are known not to be chosen manually. The follow trust factors for the most part, at least that’s what testing shows so far.

  30. @ Ankit – i would suggest you submit a spam report. This isnt the place for that question πŸ˜‰

    @Chetan – the flash sites may be ranking because they have alternative content for accessibility purposes – or they maybe ranking for the domain name – your statement is a bit vague.

  31. rishil @ I had already did that but no action its more then months now !!
    Also asked this question at SEOMOZ they were surprised themselves … may be its ok with google terms to have same content on multiple domains if you have some strong backlinks and lot of trust on all domains having that content …. just want to confirm this thing !!

  32. @ Ankit – PM me via moz the site url.

    regarding Spam submissions – these guys get hundreds if not thousands – which means its a slow process, and manual. I have one report which is months old…

  33. Matt,

    This hits very close to home for us. Since Feb 14, 2008 we have seen 4 Google IP addresses enter 50,000 search terms into our internal search engine. An example IP address is 66.249.67.3 I’m not sure what to make of it, but that seems awfully drastic compared to this post. Can you shed any light on this? I can provide the log file if necessary. Thanks for your help, as we are extremely anxious to figure this out.

  34. Will there be some tags to advice a bot which forms could be of use to crawler and which should be skipped?

  35. Just re-read my comment above and see that it’s swallowed the HTML element I was asking about. What I meant was:

    Would the Googlebot follow regular links in a [noscript] element that JS-using browsers would not see?

  36. Jason, we do have the ability to spot e.g. full urls that are in JavaScript. We’ve done that for a very long time now, I think.

    My rule of thumb is “If you can copy-and-paste a url into the address bar and get a page back, there’s a chance that a search engine could discover that url somehow (e.g. a referrer page could link back to it).” If you don’t want a particular url showing up, block it in robots.txt or add an .htaccess file to password protect that directory.

  37. If you are really bent on keeping that drop down menu but want to make it accessible (both to the search engines and to people who need assistive technology such as screen readers), you can use Javascript to transform a regular div tag filled with links into a drop down menu.

    I have posted the sample script at http://www.thegooglecache.com/white-hat-seo/really-solved-another-common-site-review-problem/

  38. Hello,

    did you ever hear the german tale of Serch Engine Robot and Webdesigner? It might be difficult to read to many in German. But lets test how good is Google Translation πŸ˜‰

    in German
    http://www.woodshed.de/publikationen/dialog-robot.html

    Translation
    http://translate.google.com/translate?u=http%3A%2F%2Fwww.woodshed.de%2Fpublikationen%2Fdialog-robot.html&langpair=de%7Cen&hl=en&ie=UTF-8

    have fun!

    regards, Thomas

    PS: Would you rate the translated storry readable and understandable?

  39. One last question:

    When Googlebot try “search queries,” does it use internal search queries data from Google Analytics?

    Matt’s fan

  40. I know I’m reviving the dead here, but I have a question regarding Google and forms.
    I have a client who is re-coding their site and for some reason, the developer has wrapped all the content within a form tag. Can wrapping the entire contents of your site in a form tag have a negative impact on the existing SEO? Would this affect crawl rates?

    The content and presentation is staying the same – and all the links are straight forward hrefs, the submit for the form is triggered by some new functionality.

    Example:
    Initial code
    Heading
    content

    Revised Code

    Heading
    content

    — Cheers

css.php