Fall weather forecast

Yahoo! gives a nice weather report to announce an index update, and it seemed like a good time to give people an update on search quality/infrastructure on Google going into the fall. The last weather forecast I did was about a month ago, and it was on video. It’s still a good video to go watch as background. Just to be crystal clear, each of the following paragraphs is talking about a different piece of infrastructure. ๐Ÿ™‚

Bigdaddy was a software upgrade to how we crawl and partially how we index the web. It was deployed and done pretty early in the year. It brought smarter Googlebot crawling, including tricks like full gzip support and a crawl caching proxy that means less bandwidth usage for site owners.

We used the summer to swap in a completely new architecture for Supplemental Results. The core of that infrastructure is complete and fully deployed, but I’m sure we’ll see additional smaller changes (mostly making sure that queries off the beaten path such as site: do what people expect).

I believe site: results estimates should be more accurate at any IP address you try now. In mid-summer (while I was on vacation, in fact), people noticed that sometimes site: results estimates were too high. One change went in during mid-summer to make general results estimates more accurate, especially for shorter queries, but the change didn’t really apply to site: results estimates.

Happily, there was another piece of infrastructure going out that improved general quality and also made site: results estimates more accurate. I think I mentioned in the video that those folks were shooting to be live everywhere by end-of-summer/end-of-quarter, but it was a hope, not a promise. I believe that infrastructure was turned on at all data centers by last Friday (Oct. 6, 2006), which is pretty close. Most of the other quality improvements due to this infrastructure will be pretty subtle/stable, but it’s nice that site: results estimates are more accurate now.

Let’s see, what else? We just did a PageRank export, so I wouldn’t expect to see another export until the new year. The infrastructure that serves up PageRank in the Google Toolbar, link: data, info: queries, and “Similar results” is also new (surprise! ๐Ÿ™‚ ). I believe that’s the only piece of infrastructure I’ve mentioned so far that isn’t deployed at every data center, and relative to the other things I’ve mentioned, that infrastructure is smaller. The new infrastructure is live at about 2/3rds of data centers, and I’d expect it to roll out to all data centers within a month or two (again that’s a hope, not a promise). In the mean time, you may see some differences in PageRanks in the Google Toolbar depending on which data center you happen to hit.

I know that webmasters are especially sensitive to quality/webspam/ranking changes in Q4 because of the holiday season. If we’ve got something that evaluates well and that we think will improve quality, we can’t just pause for 1/4th of the year, but if anything big launches I’ll try to be available to answer questions and help get a handle on any changes. (Right now I’m not expecting radical changes in webspam ranking, but I know better than to make a promise.) Of course we’ll also be around at webmaster conferences. Several Googlers (including me) will be at PubCon in Vegas in November to talk to webmasters. Several Googlers (including Adam Lasnik and Vanessa Fox, but probably not me) will also be at SES Chicago in December to get feedback and answer questions too.

Okay, that’s everything that I can think of. ๐Ÿ™‚

37 Responses to Fall weather forecast (Leave a comment)

  1. Thanks for the update Matt. That was a nice little rundown. Most of the changes have been permanently stored in most of our brains now and most of the infrastructure changes, in my opinion, were for the better. Well, off to PubCon, my first ever conference, looking forward to hearing you…”Speak” ๐Ÿ™‚

  2. Great summary. By the way, any chance of another grab-bag session soon? I’ve got some things to get off my chest ๐Ÿ˜€

  3. Matt

    That what I call a major weather report. Thanks a bunch!

    “Several Googlers (including me) will be at PubCon in Vegas in November to talk to webmasters. Several Googlers (including Adam Lasnik and Vanessa Fox, but probably not me) will also be at SES Chicago in December to get feedback and answer questions too.”

    You folks seems that you see USA is the whole world and stay there ๐Ÿ˜‰

    “The new infrastructure is live at about 2/3rds of data centers, and Iโ€™d expect it to roll out to all data centers within a month or two (again thatโ€™s a hope, not a promise).”

    Great news. Congrats.

    Are those 2/3 of DCs already serving results for Google.com etc… ?

  4. Matt,

    Interesting new supplemental structure.

    But not sure it will achieve what you want. For sites that operate very genuine template driven sites – many broadly similar products with only subtle but vital differences – this seems to be a major hurdle.

    For eg four sites that sell paper maps http://www.mapsworldwide.com, http://www.maps.com, http://www.stanfords.com and http://www.elstead.co.uk have seen the vast majority of pages go supp (700 in the main index out of 30000 in the supp for two of these.)

    Yet the vast majority of supp pages in these sites are simply not dupe content , they are unique but very very similar – for eg USGS topo map 1234 *is* very different to map 1235 – yet given a template based site the differences may be small.

    Of course the right answer is to make every page “more different”. But this is a real challenge for such similar products. Is there not a better way of handling this?

    Chris

  5. Thanks Mr. Cutts.
    i had posted a screed about your statement lumping in the Semantic Web with the hype of NLP and Deep Web at Breadwatch.

    I did not realize you are the “top cop” at Google for anti-spam measures, and must have more than enough to deal with, so you were just problalbly having fun…

    It is my wish that Google will embrace the Original Vision of The Web, one of a world of shared information.

    When Douglas Engelbart and colleagues gave the historic demo in 1968, it was the first public display of the mouse, hypertext, video conferencing, and more. All of this arose out of the simple concept of using computers to accelerate knowledge sharing.

    Perhaps in the pursuit of the Semantic Web, we will also see such novel and useful “quantum” developements, all in the ultimate goal of information sharing.

  6. > You folks seems that you see USA is the whole world and stay there ๐Ÿ˜‰

    Harith, even in my first few months at Google, I traveled to London, Dublin, and Berlin, and I’m sure I’ll attend other non-US conferences next year as well. But it’s not all about me; we have friendly and informed Googlers around the world, and I hope and expect you’ll be seeing more of them in the future ๐Ÿ™‚

  7. Matt a bit off topic, but with constant updates that frankly 98% of the population do not understand (its 2 years now and I am finally getting the jist of how the search engines Really work)

    Has Google considered a seperate forum to assist Non Profits with “seo” questions and answers.

    Some of the greatest organizations in the world are non profit, its the non profit part thats difficult. With the top seo firms charging 50-800 perhour how can a NP have a fighting chance?

    Yes I know quality content is the key, but that is one of the 100 keys NP can not afford to be educated about. What about a How to guide for Non Profits? Or even better tie a forum into the Google NP foundation where actual NP can go for assistance.

    My fear is that the people who are not motivated by money but rather by helping others will slip farther and farther into the Internet abyss due to updates and algo shifts they neither understand or can afford to understand.

    No I dont think any of the search engines are evil.
    Perhaps a few of the seo firms seeking to prey on the uneducated

    Rant over

    Sam

  8. Matt,

    I am in agreement with Chris W. We also have an template driven web site http://www.worldwideshoppingmall.co.uk, that has many product pages that are now in the supplemental index. Many of these pages have unique content, but it seems as thought the rules been applied to determine duplicate content are been taken too far. It appears to me that we simply can’t add content to get out of this hole.

    I could really do with some guidance on the policy that Google is adopting here.

    Regards,

    Simon

    ps (I thought the video sessions were really useful)

  9. Hello Matt,

    As previously discussed my PageRank has been fluctuating for the past few weeks or so but I am only curious at what time will the displayed PR update that is listed in the Google Directory. For example, if you go to http://www.google.com/Top/Shopping/Consumer_Electronics/N/?il=1 the left side has scales 1-11 rating your website. At one point, I was under the impression that the PR posted in the Google Directory and the Toolbar PR had direct relation but I have learned that they work independently.

    Damir

  10. “Harith, even in my first few months at Google, I traveled to London, Dublin, and Berlin, and Iโ€™m sure Iโ€™ll attend other non-US conferences next year as well. But itโ€™s not all about me; we have friendly and informed Googlers around the world, and I hope and expect youโ€™ll be seeing more of them in the future :)”

    Adam, I know about your recent travels. I read about it and thought you might visit other countries not far from London and Berlin ๐Ÿ™‚

    Talking about Googlers around the world. In fact I had a meeting today with 2 of them. However, they were very kind but would only discuss the AdWords deal and wouldn’t talk about SEO, PageRank or DCs ๐Ÿ™

    I thought..no talk about SEO, PageRank and DCs…no AdWords deal ๐Ÿ™‚

  11. I was posting to your bug report thread, but got an error.

    Site is mainly listed at www, a search for “exclude www” finds a few hundred non-www (all the URL-only stuff is excluded by robots.txt and you are in the process of dropping it), but on the end are many thousands of www Supplemental URLs even though they are meant to be excluded from the search: http://www.google.com/search?num=100&filter=0&q=site:resource-zone.com+-inurl:www&start=900
    I assume you can see beyond 1000 results internally, but using other searches I know that the rest of those results up to 24 000 have to be www supplemental results. I see the same effect on many other results, if you want those I could always mail them…

  12. shoppingmallguy, your meta descriptions are far too alike across multiple pages, you URLs have waaaaaaay too many hyphens to be healthy, and the on-page UNIQUE content is too light for Google to say that each page really is different. You have a problem. The problem is on your site.

    Clicked at random: http://www.worldwideshoppingmall.co.uk/potteryshop/cream-hand-brushed-gold-highlights-5-bulb-light-fitting-088-5cr.asp

    Where is your site-wide link back to the root index page? I don’t see one.

    Breadcrumb navigation would do wonders for your site. Where is it?

    Read all of Matts posts here, and then find a friendly SEO-forum and get reading about site architecture. .

  13. Supplemental Challenged

    Thanks for the update Matt, even if this is one of the nuttiest things you have ever written:

    “It brought smarter Googlebot crawling, including tricks like full gzip support and a crawl caching proxy that means less bandwidth usage for site owners.”

    I got 72 pages on PR5 site, with a couple hundred external links PR6 to PR3, with at least one external link PR4 link to every internal page (plus internal links) and the far, far stupider new Googlebot can’t manage to fully index this site. “less bandwidth usage” concerns virtually no one except spam sites when compared to “index my website” concerns.

    It would be much more comforting to read that the forecast included correcting the absymal failure of the new crawl priorities.

    The crawl priorties lead to all these PR1 free hosts and blog redirects getting freshed every day, while quality niche sites with good PR and scores of links get not crawled at all.

  14. Dave Davis, maybe I’ll see you at the Pubcon. ๐Ÿ™‚

    Ian, I’m a little tired from doing the 25 or so over in the “smaller issues” thread, so it may be a little while yet.

    Harith, we do try to get international feedback wherever we can. I just found a good source of German comments, for example. Yes, google.com usually hits the newer infrastructure that serves up PageRank results.

    Chris W, PageRank is the primary factor determining whether a url is in the main web index vs. the supplemental results, so I’d concentrate on good backlinks more than worrying about varying page layouts, etc.

    Alex Piner, that’s a whole ‘nother interesting conversation. I’m all in favor of a semantic web, but the trick (to me) is how to get people to contribute to it. On the one hand, stuff like XML and FOAF make my eyes glaze over, and I’m a pretty technical person. On the other hand, wikipedia.org clearly found a solution to letting people contribute.

    sam, good points. I certainly spend time deliberately thinking about how to help sites like that in the ranking (e.g. not aware or savvy about search engines). I agree that I often write for the expert side of the spectrum, when arguably it may be a bigger benefit to concentrate on the basic/meat-and-potatoes subjects rather than discussing lower-impact but expert stuff.

    g1smd, in general when we recrawl supplemental pages, that’s when we notice updates like 301s, etc. Oh wait, I think I misunderstood, you’re talking about “-www” queries still returning www results in supplemental. Yah, that’s a known issue I mentioned in my “Smaller issues” post: “[site:mattcutts.com -inurl:blog -inurl:files] returns a result or two with โ€œblogโ€ in the url from Supplemental results.” I pinged someone about it today though.

    S.C., you left no specifics I could debug or pass on.

  15. Matt, not just -inurl: queries, but site:domain.com inurl:faq returned the FAQ pages, and then that was followed by a load of entries for URLs that are now redirected, and for URLs that are now 404, and all were, of course, tagged as Supplemental. Those had “leaked” into the results too.

  16. Matt, not just -inurl: queries, but some others too. For example, site:domain.com inurl:faq returned the FAQ pages as expected, and then those were followed by a load of entries for URLs that are now redirected, and for URLs that are now 404, and all were, of course, tagged as Supplemental. Those URLs did not have “faq” anywhere within the URL, but they had still โ€œleakedโ€ into the results too.

  17. This may be just me being paranoid, and I’m sorry to call you out on something, Matt, but there’s a comment you made that I find somewhat worrisome:

    Chris W, PageRank is the primary factor determining whether a url is in the main web index vs. the supplemental results, so Iโ€™d concentrate on good backlinks more than worrying about varying page layouts, etc.

    True as it may be (and I believe that it is), maybe a little more of a qualifier is needed in the sense that people could take the comment the wrong way and we could have a whole new batch of PageRank-obsessed people out there trying to get backlinks to try and avoid “supplemental Hell”, without any regard for site architecture, clean code, and all the other good design stuff that can and should contribute as well.

  18. Hey Matt.

    Can you tell us what up with the recent +30 position penalty that seem to hit many webmasters with old sites. is this manually edited or an algoritmic change?

    Seems to be that alot of people that ranked #1 now rank #31 and not only for their target keyword phrase, but also for their branded domainname.

    http://forums.digitalpoint.com/showthread.php?p=1562478&highlight=fourth+page#post1562478
    http://www.webmasterworld.com/google/3087394.htm
    http://www.webmasterworld.com/google/3001677.htm

    Would be nice with some advices since the Google sitemaps Webmaster tool does not show any penalties

  19. Hello,

    I don’t really think I should be posting this here but I can’t find anything else suitable and was hoping you could shed some light on it for me.

    I found this tool I’ve never seen before on Google. Is it new? I cant get it to show up anywhere else.

    http://www.google.co.uk/search?hl=en&safe=off&q=jobs+in+norfolk&btnG=Search&meta=

  20. Question on the Supplementals Results when doing the site: query; Why when doing a site: listing all results on DC 216.239.63.99, do I not get the index page until the 14th page of supplemental when the index page is not supplemental and then the remainder pages are all supplemental except for the last 3 listings. Is this a bug or when doing site: queries are you suppose to get this weird mix?

  21. The Few, the Proud, the PubCons. Glad to hear you’ll be in Las Vegas!
    Bring Adam Sah along – he’s got great ideas for Google helping small businesses.

  22. Matt – there’s a bit of a discussion going on over at WMW http://www.webmasterworld.com/google/3110528.htm with folks trying to get an understanding on what types of filters are applied and for what reason. The opening proposal was that the term “Sandbox” is too broad a term for a potential wide range of filters, and you can see our progress. Several blogs have picked up on the thread.

    We’ve kinda run out of ideas, except for the obvious one’s involving duplicate content. Many of the comments are speculation, albeit potentially well founded.

    Can you give the good folks over at WMW a bit of insight into the major filters, factors and outcomes to help them better manage their sites out of restrictive filters.

    I for one, am struggling to get some sites lifted from badly filtered results, one of which previously ranked well.

  23. g1smd, I think that boils down to issues we’ve seen though (inurl: with supplemental isn’t perfect yet, and pages that have changed or gone away will show up until we refetch those pages and see that they’ve moved or disappeared).

    Well said, Multi-worded Adam. The last thing that I’d want is for people to obsess about PageRank when they shouldn’t.

    Mister P, I believe this is scoring the quality of those documents; even the DP thread didn’t seem to mention any specific sites.

    Niall O’M, that’s a Onebox interface to Google Base. [chicken recipes] is an example of a different Onebox interface to Google Base.

    sherrillh, I don’t remember hearing about that; is it still happening? The domain you left from your email address didn’t appear to have that behavior.

  24. Hey Matt.

    Regarding the top30-penalty: This is two affected sites
    1st-for-french-property.co.uk and hotelmotelnow.com
    Any thoughts ?

  25. Matt,

    Of course not:) I do have a screen shot if you are interested, but it seems to look as it should now. However, I do have another site that it is listing all the Supplemental first when doing the site:query and then on page 74 starts listing the non-Supplemental index page and others. I listed this one in the URI.

    Thanks for looking!

  26. Heh, sherillh: is that without &filter=0 on the search URL ?

    I see something like that with one site. This happens where Supplemental Results appear first, but as soon as I add &filter=0 to the search URL, a whole load of results that were being hidden behind the “Click for omitted results” link now jump ahead of all the Supplemental Results.

    Can you check to see if you get that effect too?

    P.S. I always search with &num=100 on the search URL to get 100 results per page. That saves a lot of clicking through results pages.

  27. g1smd, I just tried without the filter and still get the same Supplementals at the top, non-Supplementals on the bottom results.

  28. I’m unable to use the url removal tool because it says there is already an account for my email address. When I try to login, I get a “Login Error”. Is this a true error or is someone else using my account?

  29. Simple question Matt, How does one get a manually imposed penalty
    removed? ie the minus 30 penalty for one

    please comment

  30. Mike –
    You have to create a NEW account for the URL removal tool. One of my pet peaves is all the different accounts you have to have to use the various Google services.

  31. So….this minus 30 penalty is a big topic onlne right now,
    What does one have to do to have it removed?

  32. When you do a search and there is a suggested second search, the syntax is:

    Did you mean:

    Don’t we usually use a “?” as punctuation for a question? (or is it “:” )

  33. Mister P:

    Thanks for mentioning http://hotelmotelNOW.com/ and my 30 serp penalty which I have incurred for 10+ months. Although Matt didn’t respond, I think my old programmer and I finally found and fixed the problem yesterday.

    The programming firm I hired last fall was not as qualified as I had been led to believe. They really failed at error reporting. When a hotel was no longer in the database, they used first a 302 and then a 301 when they should have used a 410.

    It gets even worse. Some of the 301s were resulting in a loop. In fact, we had to shut down the server to clear it out. Anyway, my old programmer is back and he fixed the problem in 10 minutes.

    What the bad error reporting probably triggered was sneaky redirects and perhaps even suspicion of cloaking. Neither of which are true but I can see where the algo was tripped.

    This morning I filed a reinclusion request explaining the problem. As soon as the reinclusion team looks at it I feel confident that Google will remove the penalty if it is not removed automatically via the algo because Google always liked hotelmotelnow.com before I hired those programmers. Thank God my old programmer is back!

    Jim

  34. Matt,

    Thanks again for taking the time to write your blog.

    Could you please address the alleged 30 serp penalty that I have been reading so much about and appear to be suffering from as well? I think you could clear up some massive confusion for a good number of webmasters.

    Jesse

  35. I’m glad the PR export is finally done, Matt. I’m wondering if anyone else experienced a slaughterhouse effect on their back pages on their way to a higher PageRank.

  36. Don,

    I’ve recovered my site from the minus 30 penalty. I’ve started a blog with a step by step approach for webmasters at http://minus30.com.

    Its not as hard or as ambigous as it may seem. Google’s webmasters tools are basically an instruction guide of whats important to google – no httpd errors, keyword density, sitemap, etc…

    I’m open to feedback and user submissions.

  37. Iโ€™m wondering if anyone else experienced a slaughterhouse effect on their back pages on their way to a higher PageRank.

css.php