A quick word about cloaking

(Philipp Lenssen and I started talking about cloaking in a corner of the web, and I figured it would make sense to talk about cloaking in a separate post. Consider this a me-typing-this-quickly post, but better to get something down than to not get a chance to talk about it.)

Cloaking is serving different content to users than to search engines. It’s interesting that you don’t see all that much cloaking to deliver spam these days. If you see people doing spam, they tend to rely on sneaky redirects (often via JavaScript) more than cloaking. For example, a blackhat might make a doorway or keyword stuffed/gibberish page plus something like a JavaScript redirect to go to a completely different page.

Here’s the recent timeline of Philipp Lenssen talking about cloaking and WebmasterWorld (WMW) as I see it:
– Philipp wrote this post in late November: http://blog.outer-court.com/archive/2006-11-28-n23.html
The basic point was that if you searched for [php-based cms] and clicked on the #1 result (which was WMW), you would get a registration page rather than the page that Googlebot saw.
– I didn’t have the cycles to deal with it right then, but earlier this year I made it clear that WMW would be removed if it met the definition of cloaking when I tested it.
– I believe the administrator (Brett) of WMW made code changes to the site so that WMW would not be considered cloaking.
– I recently tested with the example Philipp originally mentioned. I did a search for [php-based cms], clicked on the #1 result, and got the same page that Googlebot saw with no registration page.

Those code changes address many of concerns I’ve heard regarding WMW (that users who click on the results don’t get what Googlebot crawled).

So I consider Philipp’s November article acted on. Philipp’s December post about it (http://blog.outer-court.com/archive/2006-12-13-n85.html) cited the same search for [php-based cms], so I consider that acted on as well.

I believe that takes the timeline up to February. I’m aware of two other posts Philipp did on this topic, both in February. The first is http://blog.outer-court.com/archive/2007-02-05.html which says that if you go to http://www.webmasterworld.com/forum44/1287.htm that you get redirected to a registration page. When I tried it just now, I got the actual page with a “Welcome to WebmasterWorld Guest from (ip address)”. The question I’d look at for this report was if you typed this url into Google, and then clicked on the result–does the user receive the same content that Google saw?

The other article that I’m aware of is http://blog.outer-court.com/archive/2007-02-20-n47.html where Philipp states that sometimes Google allows sites to go against our webmaster practices. But that statement includes an asterisk; the disclaimer at the bottom of that post is “WebmasterWorld doesn’t always show the registration page; they sometimes show the content that was available in the snippet.” As I understand it, that disclaimer acknowledges that some of the time, WMW gives users what Googlebot crawled. When I get a chance to tackle Philipp’s most recent report, I’ll be looking at consistency: when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw. It will take me a little time to check out, because it’s a report of behavior that often meets our guidelines (e.g. cookies, referrers, IP addresses might all come into play), but I do intend to investigate this issue when I get the cycles. I won’t consider this issue closed until I have the time to investigate how consistently the return-the-same-content-as-Googlebot-saw behavior happens; it should happen for every click from a Google search result.

To sum up, we did take action on Philipp’s questions about WMW. I consider the issue in a much better state now, in that most (all?) Google searchers get the identical page to what Googlebot saw. But I still consider Philipp’s February posts open for investigation, and I will get to them, in the same way that I tackled Philipp’s first two posts about this.

268 Responses to A quick word about cloaking (Leave a comment)

  1. Just wondering if I got the sum right that was used for SPAM protection 🙂

    Just to let you know that I am not a SEO but I run my personal website related to my profession.
    Actually, I have a problem with some of my webpages.
    Can you please let me know if Googlebot can actually run JavaScripts? I saw a few forum posts that suggest, URLs can be extracted from the JS code. But the problem with my webpage shows that Googlebot can actually execute the javascript code.

    I used javascript code to solve the orphan page peoblem of my framed website. [ as suggested in http://www.netmechanic.com/news/vol5/javascript_no7.htm ]

    Within the last week, I found that my high ranking website on SAP ( ERP ) has lost its rank for a lot many keywords. But that’s not the problem, I want to discuss here.

    One of my website page http://www.geocities.com/rmtiwari/Resources/Management/ASAP_Links.html ranks number one ( still is ) for the term ‘ASAP Methodology’.

    However, from today onwards the title of the webpage has changed to the main.html title ( not ASAP_Links.html ), used in Javascript to solve the orphan page issue. That seems to suggest, Googlebot is actually executing the Javascript. And all my pages will eventually end up in having the same title.

    Please suggest.

    Thanks,
    Ram

  2. Matt

    Thanks for detailed feedback on the subject.

    However, when I read comments on Phillipp Lenssen like the ones bellow, I just wonder whether he is singling WMW for discussions related to cloaking or for “other reasons” 😉

    “It’s *really* unfortunate that anyone (especially search engine representatives) continues to post to WebmasterWorld instead of an open forum.
    Webmasterworld.com is easily the most deficient forum site I’ve encountered.”

    “That depends…do you sponsor conferences and have google employees speak at them? If so you can do anything you want…if not, then you’ll have to play by the rules.”

    “Androw, make sure you follow the precise technical implementation of WebmasterWorld. This should be safe, because WMV isn’t banned.”

  3. Since I do a lot of AdSense related searches, I know that any entry in the AdSense forum is not viewable unless you are logged in, regardless of IP. I was looking up something I wrote there on smart pricing while I was in London at SES last month, and I got rerouted to the login page. And I just checked it now from home, and same thing. The AdSense forum has done that for a couple years at least.

    http://www.google.com/search?hl=en&q=adsense+smart+pricing+update&btnG=Search&meta=

    YPN forum is the same, I don’t frequent the others as much to know.

    However, I suspect Brett will change that query to give the true page as soon as he sees this 😉

  4. Well… there is a ton of evidance of this going on. Is this investigation going to close when webmasterworld has enough time to fix it?

  5. Matt-

    Great post – what bout people who use counter scripts to gain first page rankings for impossible key words? Mesolink.com comes to mind as they pretty much own the mesothelioma space and have done so by palcing thousands of counters on unrelated web sires and .edu’s thus gaining them #1 position google – even over the national cancer pages for words like mesothelioma etc… all they are is a lawyer looking for and reselling leads – the site is hardly even maintained.

    Web sites that have used methods like this should looked into.

  6. Ok so how is this query not cloaking

    http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=cra&q=new+york+times+january+2006&btnG=Search

    screen shot in case you see something different

    http://farm1.static.flickr.com/152/409090766_b136513bef_o.png

    I don’t see any of the text from the snippet “Go get yourself fitted for a new chain-mail vest. A bevy of experiments in” on the page I see. Since they have caching turned off (a good idea when cloaking) I can’t see exactly what you saw but your snippet tells me what you did see at one point in time.

    here the landing page I get

    http://select.nytimes.com/gst/abstract.html?res=F10616F73D540C718CDDA80894DF404482

  7. Harith, those comments cited above, except the last one, aren’t by me. The last one by me is a very simple reasoning: if site A does something and isn’t banned, then you can do the same if you follow site’s A process, because Google uses the same guidelines and algorithms to ban or not ban sites (hopefully). For the record, I think WMW is a cool resource, I’m registered since a long time, and I don’t have any personal “history” with the site or its owner, as you seem to imply.

    Matt, as for your explanations, I think we still have some way to go to solve this. I don’t know if WMV changed something since my first post, it’s quite possible, but what I now often get is this behavior: the first click on the site shows the content, and if I click on it a second time from Google, I see the registration page. However, I can also not consistently reproduce this. I consider the behavior when I am forwarded to a registration page a “sneaky redirect” because I’m not getting to the content that the snippet presented, and “sneaky redirects” aren’t permitted per Google’s webmaster guidelines. If WMW changed their code in response to the discussion we had, mostly showing the real content now, then I think that’s a great step in the right direction, albeit they didn’t get rid of all “sneaky redirects” yet.

  8. First, thanks to Matt for the heads up. As most know, wmw has been the target of extreme amounts of bot activity over the years and has taken proactive steps to fight it. Our first and foremost job is to the regular members of the site. I have done everything I can think of to stop the bots from the cable and dsl isp’s. I have even gone so far as to ban entire tlds some of time (china, russia) where heavy botnet activity exists.

    I think we have finally found a system that everyone can live with and keeps the content open for everyone, but slows down the bots. I say slows down, because there is no way to stop them.

    Thanks Matt
    bt

  9. Matt, are you familiar with this thread at the SEW forum regarding nytimes.com? At the beginning of the discussion, Danny Sullivan stated that he believed they were cloaking. By the end, he’d changed his mind about it.

    In my opinion, it’s cloaking.

  10. Phillip, yes it has changed since your first post. I am not going to detail it, since clearly, some of the botnet operators are prime readers right here. I actually built the new system to matts specs. I don’t blame anyone for being upset that there ISP was on a required login – I also get miffed when we click through to NY times and are forced to login.

    On the other hand, I am bothered that some of the criticism of those policies always come from advertising program supporters and operators. They have no understanding of the risks and problems associated with trying to run an essentially free subscription site. Go to copyscape and put in webmasterworld. Thousands upon hundreds of thousands of messages ripped off. Entire sites exist just to rip off members (Like even Google Guy and MSNDude). How can you possibly compare a site that lives in that envirnment to a normal site based on traffic is everything advertising? It is a completely different proposition with different needs and requirements that aren’t based in traffic for traffics sake.

    Philipp, if you want a tour around the back end to see how it works, I would more than be happy to show it to you if you wouldn’t publish specifics, but aggregated data on it.

    bt

  11. Harith, I honestly think that Philipp just wanted to make sure that Google’s quality guidelines are interpreted fairly and consistently, and he’s right to want that.

    Philipp, I made the edit for you and deleted the extra comment. As I noted in the post, I believe WMW did make code changes since your first post to deliver to users what Googlebot crawled. And I do consider your February posts still open and will investigate them in more depth.

  12. What bugs me about WMW is how some people see the login pages sometimes and others don’t. It feels like they’re just playing games with the users. What if it only displayed the login page 10% of the time, it would be almost impossible to prove. What’s wrong with just doing it right? Considering they have a whole forum on cloaking, I doubt they’re going to drop it completely… 🙁

    What is the official take of sites like the NY Times which show the full article to the Googlebot but only the first section (snippet) to the non-registered user? It’s easy to tell that the Googlebot it getting different content (from Google’s snippet). Is that close enough?

    Example: [Attacking online China]
    Shows a different snippet in the search results than in the abstract-page that solicits buying the full article or a subscription.

    Note: I don’t want to single out any site, it would just be great to have a clear guideline :-).

    How much cloaking is ok? Is removing the “filler” navigation/sidebar ok? How about removing ads? Javascript? Session-IDs?

  13. I think, i jumped into somekind of a personal and rather historical discussion. Sorry about that.

    Actually, I tried to post my question in one of the forums and incidently, I found WMW on the google search.

    WMW seems to be a big forum [ almost sounds like a BMW ;-)] but they charge for the membership, it seems. Anyway, I will try my luck on other forums.

  14. I agree with Phillip on the whole thing with WWW being pretty sneaky, and I have always had confidence that it would be looked into.

    I am writing the guideline descriptions, but I am a bit stuck on defining and explaining to the general new webmaster what a “sneaky re-direct” is. I know there are many examples, but any suggestions by you Matt, or anyone reading this, if you know of a good article or someplace Google has defined “sneaky” I would appreciate it if you could pass that info to me.

    I get alot of questions about it.

    I see “cloaking” as well defined, but in all honesty the best example I have yet to see of something being “sneaky” is WWW redirect to that login screen. I will not use that in my Feedthebot description, but it does seem to be sneaky.

    Let me know about anything that would help me define “sneaky re-directs!
    pat, guidelineguy@gmail.com

  15. In my opinion, no amount of cloaking should be okay. As a user, I don’t want to see something juicy on the search results just to have to hit the back button as soon as I hit the page.

    It would probably be expensive but, it seems like Google should have two bots. One Google Bot and one that identifies itself as IE. Then the two results should be compared and the sites that don’t match could be easily dropped. Like I said, expensive and would take up bandwidth but it would clear the whole problem up in one shot.

  16. The reason that WMW cloaks is pretty clear – money.

    It is simply playing on the fact that many newbies get confused between the free membership and the paid membership and cough up the dough.

    It is as cynical and manipulative as any other form of cloaking and as such needs treating in the same way.

    FWIW , I am neither pro or anti cloaking. I am anti hypocrisy.
    In a perfect world, Google would have an ‘intent’ detector, and in that perfect world WMW would be banned.

  17. I still occasionally get the registration screen when clicking on links to the WMW discussions. The system has a way to go but there has been improvement.

    It would still be nice if Google would do something about the academic paper archives that cloak like spammers on speed.

  18. Springer does the same thing with their journal site(s).

    I was searching for a paper by Feigenbaum, “Quantitative Universality for a Class of Nonlinear Transformations”, and popped the title into Google.

    The top result is always:

    > [PDF]
    Quantitative universality for a class of nonlinear transformations
    > File Format: PDF/Adobe Acrobat
    > Quantitative Universality for a Class of. Nonlinear Transformations. Mitchell J.
    > Feigenbaum. Received October 31, 1977. A large class of recursion relations …
    > http://www.springerlink.com/index/Q555G62245040133.pdf – Similar pages

    Clicking through, however, you’re redirected to http://www.springerlink.com/content/q555g62245040133/

    … which is just an abstract. The paper itself is behind a paywall.

    I run into this every time I Google up a paper that’s held by Springer.

  19. Please don’t ban this site, but I have the same problem with this site – http://mail-archives.apache.org/ – content listed in google v content found after clicking on the link. It is a forum and very frustrating when you think you have found the post you want in google but cannot be found after after clicking on the link without navigating down the forum.

  20. We did detail a few bits about the new system in a post,
    http://www.webmasterworld.com/webmasterworld/3270248.htm
    The the time element is probably why you had to login Michael. We have dialed that down to as short as I feel is ok – but you still have bots that slow crawl at a page view every 2-3 minutes – so it still isn’t perfect.

  21. > Example: [Attacking online China]
    > Shows a different snippet in the search results than
    > in the abstract-page that solicits buying the full article
    > or a subscription.

    Google shows this URL (shortened):
    http://www.nytimes.com/2007/02/05/technology/05marx...

    However, you’ll be redirected to this URL:
    select.nytimes.com/gst/abstract.html?res=F00…

    And interestingly enough, the NYT does not allow Google to show a cached copy of the page. Utills once said ( http://blog.outer-court.com/forum/40604.html#id40612 ) that the NYT “fades out” pages, meaning they’re live for everyone for a while — human visitor, Googlebot (no cloaking) –, and then after a while they’re starting to redirect to a registration page. I don’t know if that’s the case, or in case it is whether or not that fits the “sneaky redirect” definition (while I don’t like this behavior I think a website has the right to change page behavior over time, and at least the old page content will be ultimately “faded out” of Google’s index too). But I’m curious to hear Google’s, e.g. Matt’s statement on this, because webmasters need transparency over these issues.

  22. Aren’t there existing agreements between Google and the NYT that allow NYT to do that? Which I always thought suprising giving NYT comittment to transparency.

  23. Hi Matt. What about multivariant testing? Would G consider this a form of cloaking. In the strictest definition, I suspect that they would. How would you suggest we go about performing such testing without offending the almighty G? Thanks in advance for your time 🙂

    Bentley007

  24. Congratulations Matt on this open discourse! Scrapers stealing your content sucks, but the playing field has to be the same for everybody. Not everybody is going to be as ethical as Brett and use cloaking to generate their own mailing lists or worse yet to sell subscriptions based on the free snippet provided by google. I for one wouldn’t mind seeing a little (subscription required) next to a result if that was true. But then the question becomes, where should those rank? With the regular free results or on their own? or only if free resources are low? Tough questions, but that’s why you get the big bucks.

    Success has a price and perhaps that price is to decide if you want to be in google’s results or ban all bots and stop scrapers, of course the price of that is that someone will find a way to steal the content and they’ll rank where WMW should have.

    My only request is that ALL sites be treated the same. Google’s always prided themselves on no one being able to ‘buy’ their way to the top or into the index, and money isn’t the only currency. So with a move like this post, you are enforcing that directive even more.

    Oh, and hats of to Philip for hanging tough on this and getting results as well as to Matt for giving the results that we’ve all been waiting for.

  25. Matt, I understand cloaking is bad and that sneaky redirects are strictly no-questions-asked ban behaviour. Good for wmw that they were given an opportunity to redesign the forum to Google’s specs before any action was taken. Perhaps such an important resource deserves it…

    What would you recommend to more mortal webmasters with sites a lot less important than wmw, whose servers got hacked and websites banned because listings in Google were -unknown to the owner- being cloaked and sneaky redirecting to some other site?

    Reinclusions are not being responded to.

  26. Matt,

    Isn’t this also utilized for valid services as well? I’ve noticed this method utilized by sites that are heavily flash… they push their content through some type of server-side scripting to both a page and Flash file, then they utilize Javascript to replace the content client-side.

    Regards,
    Doug

  27. What peeves me are all the spam blogs on blogger that do the same types of things. I follow a Google blog search link to a blogger blog and I either get redirected completely or I get fed something totally different (via javascript) and there’s no report button up top to click on.

  28. I notice a bunch of cloaking software sites with graybar pagerank out there, looks like I will have to prune a few folks and links from my blog to break negative link & keyword associations.

    Nothing personal guys, seen that “Sorry but I am going to have to block you” commercial?

    ;-o

  29. I fail to see why all sites have to be ‘treated equally’. Google’s intent is to stop spam, not have a web love-in. The only reason to treat sites equally is if they can’t be bothered to hand review stuff or look at exceptions. And I certainly don’t want that.

    Some sites will have valid reasons for cloaking. Perhaps WMW has a valid point for example – I know if I was paying a ton of cash to feed the bots each month, and if cloaking would fix that, then I’d be cloaking too. And at that point, I’d probably appreciate Google being able to listen to reason on the odd case instead of the rubber stamp of ‘all sites must be treated equal’.

    That being said, if the NYT’s intent isn’t cloaking for cash, then perhaps some type of capcha would prove their sincerity more than a registration screen.

  30. The true page that visitors see IS being presented. The URL I click on in the Google SERPs is *exactly* the precise one I get to see, every single time.

    I spent a good 4 years with as much on-site time at WebmasterWorld as just about anyone else around, and I clearly recall times when the site would slow to crawl because of rogue bot activity.

    Any useful, popular site with a serious enough problem to negatively impact their entire user base would have to be NUTS not to exclude unauthorized access.

    http://www.google.com/support/webmasters/bin/answer.py?answer=35769

    “Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service. Google does not recommend the use of products such as WebPosition Gold™ that send automatic or programmatic queries to Google.”

    It’s like the difference between practicing curative and preventive medicine. Unfortunately, there are “diseases” that affect websites that will ravage them unless preventive measures are taken beforehand. Like captchas and requiring registering to control access and damage by spammers, scrapers and thieves.

  31. Glad to see that this issue is finally getting resolved (I also wrote a long article about it back in November). What if many sites started doing this kind of cloaking/login screen? It would would be terrible for the Web in general. Thanks for fixing it.

  32. It appears to be the second click while at WMW that now imposes the login form; third if you have referers disabled. This is true even if you have the same browser window open and just click > back > click > back > click.

  33. Here’s another sample:
    http://www.google.com/searchq=site%3Awebmasterworld.com+php

    Click on a link, then back, then a different link, and back… the third or fourth link I try, linked from Google, requires login.

  34. Matt: The subject of your post caught my eye. I have always wanted to discuss cloaking but not in the same context as you are discussing with WMW.

    I want to describe an example of cloaking that I think *should* be okay, and I want to get your take on if it is or is not, and if not why not.

    Let’s say you have a “professional” blog with content on the page, but for each Permalink page there is a lot of cross promoting of other posts and categories, etc. There are also all the advertisements on the page. And affliate promotions. And so on.

    So looking at the markup, the page content might comprise less than 1/2 of the total markup. That markup is of course there to market addition content of potential interest to the reader, but in reality it is not really part of the page’s content.

    On the other hand, when Googlebot visits the blog serves a very clean version of the page with only the content; no ads, no cross-marketing links, no affliate links, not footers, etc. This way Google can actually see what the page is about and not have all the cruft to wade through. It would be a lot like the information a “Print This” link would display.

    So is this *bad* cloaking, and if so, why? Also if so, is there not a way that a site owner can provide hints to Google’s indexer, like maybe recognizing that what is contained within a DIV tag with an attribute of @id=”content” is the page’s content and the rest should be ignored for indexing much like how AdSense allows similar?

    P.S. My blogs currently DO NOT do what I asked about so don’t go trying to ban them or anything! 😉

  35. Matt – Isn’t this along the lines of the First Click Free program? I pushed for it back at my last company because they were whitelisting Googlebot so that all articles would be indexed but if a user clicked on an article from a SERP they would be redirected to a login page. But with First Click Free, the document stated that we should continue to allow Google to view the content based on the User-Agent and that if a user clicks on a link from Google, that page needs to be provided to the user without registering. If it’s a multi-page article, the user needs to be given all pages of the article. If they navigate anywhere else, a login prompt can be thrown at them.

  36. Vic, that is exactly what we have implimented. (see: http://www.webmasterworld.com/webmasterworld/3270248.htm ) After that, only a selected set of isps are required to login. Then only those isps that have had a history of bot abuse. I have been debating detailing it entirely, but how do I do that and not expose the system to 10 fold the abuse by letting bot runners know exactly the weaknesses in the system? When some of the abuse is (imho) by competitors, how can you detail what is effectively a security matter without making the system moot? That is the reason we don’t pull a google and throw up a “you look like a bot” page and then require them to login – they’d know what triggered block action.

  37. We faced the same dilemma because in theory anyone can game the system by just searching in Google, get the article they want – go back to Google and so forth. We had our own very powerful search tool but if a user wanted to use it, they’d have to login. Also, if the page is cached, then a user can just view cache to see the article. The other part of the first click free is that by allowing Googlebot to see the content based only on a User-Agent lookup, anyone can use any number of tools to pretend they are Googlebot and view all of your content.

  38. Is a page (IP) cloaked? Use Google Analytics! You will see what Googlebot see!

  39. If I were google I’d take this approach.

    1. Flag any site that doesn’t allow caching as suspicious, and internally cache the result.
    2. Send a ‘userbot’ (my term) that doesn’t identify itself as a bot and comes from an unkown IP pool and internally cache the results of the latter crawl. (geez, I get these all the time from China and southern asia, so it must be possible)
    3. Immediately send back 1, and check whether content still differs.
    4. Ban site.

    Doc

  40. If you want to see a cloaking, just look at the pay per view online newspapers (or online version of paper newspaper):

    Google get a snippet and when you click on the link (or even on the cache) you get redirected to a page you have to subscribe to go further (i.e. to pay to read the article). Sometimes there is not even a cache, only a snippet of the article (in Google description) leading you to a registration page.

    To me, it is the most annoying kind of cloaking and these results don’t deserve to be in Google. It is wasting my time when I look for information.

  41. “Nick – I think the original Nick here.” and “dockarl”:

    I don’t think it would be fair to simply spider a page using two different bots and see if the content returned to each differs because some websites have page elements that change dynamically on each visit. (Take a simple server-side hit counter, for example, or a “Random Tip” feature.)

    If Google actually took an approach like this, they’d probably have to consider what % of the page content being served to each bot was different and maybe even request each page twice (or more) using the same bot to see if the page content usually differs… and even that wouldn’t be accurate all the time.

  42. Hi Tony – yup – granted not all sites remain exactly the same between crawls because of these minor elements.

    What I’m talkikng about are sites line wmw and NY times that do (or have) entirely cloak content to users but not to bots, or vice versa.

    A great example would be if I had a site that presented content to all bots suggesting it was a resource page for a topic such as ‘search engine cloaking’, and in actuality, it was presenting ‘Hot Russian Brides’ to search engines.

    We’re talking big diffs here, not fringe changes.

    Doc

  43. Matt, we use cloaking (or so I’m told by our hosting service) because our main site

    http://www.armchair-travel.com

    has the word “travel” in it, and some of our Virtual Tour customers have firewalls that block any URL with the word “travel” in them…

    So we have an alias of our website

    http://www.armchairvr.com

    which is the same data but served under the “vr” rather than “-travel”.

    We are not concerned with links into the “armchairvr” site, or listings for it in search engines. It is just a convenience for some of our clients.

    I would guess that it is possible that someone would link to the “armchairvr” site without our knowledge (not likely, but possible).

    How would you view that?

    Thanks
    William

  44. Hello,

    Just a comment concerning users who need to check if a page is cloaked.
    Usually, you have to check the page in the Google’s cache.

    But how can you do if this page uses the noarchive function ?

    A method that maybe you already know consists of doing IP spoofing of GoogleBot.

    To reach this IP, you have to use Google Analytics then select the function eyetracking of Analytics (number of clicks for each area on the page).
    Previously you must create a new profile for the url of the new website.
    Here is an example of a cloaked page:
    http://arnoweb.free.fr/concours_crack_cloaking.jpg

    Very interesting to identify easily webmasters who practive Cloaking without technical knowledge.

  45. lol

    This thread is getting to the heart of the “never” ending debate of “What is cloaking”?

    Okay; Obviously, I do “not” agree that WMW is cloaking as per the definition of cloaking as I see it. Cloaking IS search engine spam. WMW is not cloaking. Period. If you are detecting Googlebot “solely” and intentionally in order to server googlebot one page, and to serve a regular search engine user another page, then yes, that is cloaking and that is search engine spam. All WMW is doing is detecting ALL spiders it does not want to allow access to and denying them. Along with that, they are detecting a “signed-in” member as well and giving them the url. Along with that, they are detecting a real user who is “not” subscribed (cookies) and giving them the registration page.

    NONE of this is “cloaking”. If you truly use the real definition of cloaking as being the detection of a search engine spider (googlebot) and giving that spider a different page than a user sees, then you are cloaking and you are implementing search engine spam.

    In my mind; cloaking is “always” spam. It cannot be anything else. If this industry does not want to be totally transparent, then expand the definition of cloaking to other areas as well. What you will achieve is claiming that any site who is detecting the user agent “By Country” and sending them to a country specific page would also be cloaking. I could go on and on about this. If my site is detecting whether or not the user agent has flashed installed, and sending that agent to the appropriate page… am I now cloaking as well? NOT. Period.

    WMW is Not cloaking.

  46. Further clarification;

    I do not like the fact that I click on a WMW link in Google serps and get a sign-in page. No doubt about it,.. I hate it. But the difference is; I understand the “why” behind it. It totally has nothing to do with “cloaking” as cloaking is always search engine spam. I truly hate the idea of calling anything and everything cloaking. Cloaking is detecting googlebot “only” or yahoo and MSN, and sending those bots to a different page rather than what “sally” sees in the google serps. That is cloaking. Lumping together ALL forms of content delivery and calling it cloaking is not in the best interest of our industry. Not at all.

  47. Maybe a fix to this could be that Google puts a disclaimer on pages that require a sign-in? Maybe a meta tag that google would recognize could trigger this little disclaimer “beside” the serp listing? Something like meta name=subscribe content=requires sign-in. This could trigger the disclaimer on the serps. I don’t know what the answer is, and I understand the big problem. But I also don’t want everything to be called cloaking when it clearly is not.

    My forums have a private area. Right now I am showing the signin (have to subscribe) to anything and everything that accesses it. Only those with the correct cookie gets to see the thread. But if I wanted to “detect” ANY user agent that is not subscribed and give them the sign-in page, but allow ANY user agent to see the thread that IS signed in, am I cloaking? Nope.

    A simple little disclaiming beside the google listing would go along way to the confusion. The search engine user could clearly see they have to subscribe before clicking the link. But it gives the user the option of clicking… or not. It also would serve to give all pages access to the google serps without confusion. Just because a page is a subscription page should not be a reason to deny that page a listing in the serps. There has to be a way that all parties can live with.

  48. I do not like the fact that I click on a WMW link in Google serps and get a sign-in page. No doubt about it,.. I hate it.

    That’s it! A useless “user experience”. 🙂

    I don’t have a “paid” membership to WMW and have always been bothered by any traces of it in the serps. People also link to it from their blogs, you go there and you can’t even signin unless you pay, lame!

    I would weigh if this “claim” of bots is true or not, something doesn’t smell right.

  49. Doug, thats your definition, I think most people would call what WMW was doing cloaking. As Aaron says, it makes a useless user experience.

    Brett can justify it any way he wants, the fact is no other webmaster forum that I know of needs to do these things. WMW is neither the largest, nor the most popular, and content scraping is not a problem unique to them.

    There are numerous ways to combat content scraping that do not involve cloaking. Firstly many scrapers do not spoof their user agent, so your first step is banning all said user agents. Secondly many do spoof their useragent but do so in an unstandard way (an IE useragent in a format no actual IE installation uses for instance). Detect and ban that. Thirdly once you do detect bad behavior, ban the IP (like with a brute force SSH login firewall). You can setup a bot trap with a link that is blocked with meta robots and robots.txt and ban any IP that accesses it. Banning entire abusive countries or IP blocks that are used by server companies and not IPs is another good step as well. Finally there are server modifications for Apache that can throttle users who request too many repeated pages.

    No actual user is going to request 1000+ pages a day, but on a large site bots certainly will. Once a user reaches that level, give them extra attention, compare IPs with the IPs of known search engine spiders. If it doesn’t match, ban them. So they were only banned once they reached 1000 pages, thats like less than 1/1000th of your entire site, big deal. Obviously to implement his cloaking system Brett already has an IP comparison system, he just would have to switch how it is used.

    Of course if the bot activity is slowing down your site to a bad degree, then it is probably time to upgrade your hardware or software. Install a caching system.

    Finally, realize that if people are ripping off your content, you can fight back. Send DMCA requests out to their hosting company if they’re in the US and to Google. It works. If they’re some piddly website hosted in India, just try to ignore them. There is no way they’ll ever accumulate enough links to rank well for anything.

    More or less every site I’ve ever published has been ripped off and republished atleast one time. It hasn’t driven me to cloak.

    The fact is, no matter what Brett says is his reasoning, WMW benefits from their cloaking by increasing user registration rates while gaming the SERPs. Its wrong that Google should allow it, and I think Matt’s recent investigations were overdue.

  50. Oh Ya, Doug, your examples present poor arguments. Private forums aren’t cloaking of course, no one has said that or would say that. The issue isn’t if a site requires registration or subscription, but rather if it tricks visitors into visiting by showing Googlebot special content different from what a first time user sees.

    The solution for this is not a special icon or message in the SERPs, but that if you want your site to be registration or subscription only, you need to accept the fact that your content is not going to be indexed. You cannot have your cake and eat it too.

  51. Good. This is a problem with that has bothered me for a long time. Here is one post from 2005 I made: http://management.curiouscatblog.net/2005/01/30/web-search-improvements/

    I would think Google could come up with a user participation tool that could help identify this type of behavior. Just give people the ability to click something saying the search link didn’t return the expected results or something (I am sure Google can work on the details). I am sure Google can figure out how to separate, from that feedback, the problematic web sites from all the mistaken, malicious… clicks. Then those few sites, site sections and pages that systemically appear could be forwarded for human examination. This really doesn’t seem like it would be very tricky.

    I suppose Google might worry about he confusion such an attempt might cause regular users. Ok, just make it an option. Then users that like me use Google alot and get annoyed at having to ignore those sites I have learned won’t provide the content Google says they will can choose to participate. Yes I understand this will create a self selected population identifying problems but I really don’t see that as a problem.

    Another options I mentioned in 2005 is to give me a way of adding a list of websites I don’t want returned. Actually now I can just use a Google CSE to do this. In fact I just did – http://www.google.com/coop/cse?cx=002278424765197393586%3Alfichkez90e&hl=en I excluded webmasterworld.com to test it out and it does work. Less than 2 minutes to do. This doesn’t do anything for those who think it is unfair for some to have Google return cloaked pages as results, but it does help if you just want to exclude certain sites from your results.

  52. > Firstly many scrapers do not spoof their user agent,
    > so your first step is banning all said user agents

    Done. Aside from the standard 20-25 bot names, the following are required login.

    java , snoop, lwp, php, boitho.com, spiderman, downloader, Missigua, HTTrack, Fetch API Request, webstripper, Jakarta, IEAutoDiscovery, Zehuti, bot.mainseek.net, TMCrawler, megite, Bottino, QihooBot, Hatena, Missigua, Bookdog, NewsGator, API Request, Macan, Python, webmon, yoono, Boston, Jakarta

    That entire lot is rarely seen and cause little problem and we are probably going to delete it as totally ineffective any more. Less than 1 in 500 bots use actual spider agent names.

    > Secondly many do spoof their useragent but do so in
    > an unstandard way (an IE useragent in a format no
    > actual IE installation uses for instance).

    There are over 200 known legit variations on ie names and no known way of determining which is a bot and which is not. We do account for some that are known.

    > Thirdly once you do detect bad behavior,
    > ban the IP (like with a brute force SSH login firewall).

    IP banning is of limited use in a dynamic ip world. This is especially true where many top webmasters live in high density places like San Jose or Mountain View. One of the biggest problem areas have been cable modem users in the Valley. It is a hot bed of search engine activity and not unusual for us to ban an ip and get a complaint letter from a se worker the same day. eg: the ip was recycled by the isp. This exact scenario happened with AdWords Advisor who had the same IP on the exact same day we banned the ip for a 20k an hour bot running. You simply can’t go whole sale banning of ips without sooner or later having a problem.

    > Banning entire abusive countries or IP blocks
    > that are used by server companies and not IPs
    > is another good step as well.

    Not really. We did that in dec and again last month with .de. That in turn lead to some blog stories turnning up from Phillip. You can mass ban entire countries in a vacumn. You can require them to support cookies – if not login.

    > No actual user is going to request 1000+ pages a day

    We have more than 500 users that visit more than 500 pages a day – We have over 300 users that visit more than 1000 pages a day. We have another 50 that visit more than 2k a day, several moderators hit 3-5k a day, and it is not unusual for me or admins to have over 10k requests a day. There is no way to determine which is a bot and which is not.

    > If it doesn’t match, ban them.

    Nope – you would ban half of rr.com if you did that.

    > Of course if the bot activity is slowing down your site to a bad degree

    If left unchecked, it would be 10 to 1 bots to humans. The site is there for the humans – not the bots.

    > Send DMCA requests out to their hosting
    > company if they’re in the US and to Google.

    It would take a full time legal person to do that right. How you going to afford that on a subscription basis? The only thing that would do is get you a whole lot of negative press and blog stories.

  53. I may be the only one who thinks this, but I really don’t think the whole WMW thing is that big of a problem, especially with what’s coming down the pipe.

    *** STRICTLY SPECULATIVE COMMENTARY TO FOLLOW HERE ***

    (Translation: don’t crawl up my ass if I’m wrong. This is just me predicting stuff.)

    With personalization of results, it’s not inconceivable to assume that users will be able to tailor the results such that certain domains/sites do (or more importantly in this case, don’t) show up. You don’t want to see WMW at the top of a SERP for something? Filter out the domain. Since most people who would be looking for WMW-related stuff are web/SEO/tech-geeks anyway, this won’t be that difficult.

    Besides that, most of us who have even remotely followed the WMW situation are aware that we will be required to register for the site. I can’t speak for anyone else, but if I see WMW on the first page of some ultra-geeky SERP I created, the red flag goes up in my head and I don’t click.

    Don’t get me wrong: I won’t use anything that requires registration before viewing even the most basic of content. But it’s their site to do as they wish with, and they’re not doing anything wrong from a strictly SEO standpoint as far as I’m concerned. It’s a bad user experience, yes, but big G didn’t create it…WMW did. Let them fix the problem if they want to, or leave it alone if they choose to.

    As far as the inconsistency of results goes (with some content showing and some not), I haven’t seen that personally, but if it is going on, again, that’s WMW’s problem.

    We can all still vote with our keyboards and mice and go somewhere else, so why are we even talking about this? Let’s move on to a real issue, like unoriginal-content MFA sites or something.

  54. Brett if you really have legit people viewing the site 1000+ times per day, and I really doubt it (strictly speaking, there are only 24 hours in a day, to read 1000 pages of content would be difficult even if it were your fulltime job). You can always increase the limit, 5000, 10000, It would be a simple matter to exclude registered/logged in people and moderators just incase you did have a user who viewed so many pages. With millions of posts even at a limit as high as 10,000 they’d still only get a fraction of your site.

    Also, I didn’t mean you should ban western countries. I get tons of spam from .nl and .de but I don’t ban them. But Russia, Korea, Ukraine, China, Nigeria, etc. Definitely. You will of course undoubtedly end up banning some real users, but until these countries get real about policing their Internet usage its a valid tactic to combat not only site ripping but all forms of spam and hacking attempts.

    Of course the main reason I doubt you’re honestly doing this for the bots is the fact that you don’t allow Google to cache your content. If you allowed them to cache it then, when presented with a registration box instead of the content alluded to in the SERPs, the user could simply view the cache. You don’t allow this, and it can’t be because of bots because…

    1. Viewing Google cache causes no resource drain on your server.
    2. If a person is smart enough to crack Google’s cache id algorithm, and to figure out a way to get a total list of all your urls to query the Google cache with, then they’re smart enough to write a user-agent-bot that can handle cookies and so both login and rip your site. Assuming Google doesn’t have any backend security to prevent cache abuse (and I’m sure they do).

    You can try to justify the cloaking with a bot excuse, but how do you justify the nocache then? I think its obvious that that method is with out a doubt aimed at users, and not bots.

  55. Chris, I am all for finding solutions to the issue, and appreciate the discourse.

    > It would be a simple matter to exclude
    > registered/logged in people and moderators

    Which is what we do Chris. Those aren’t an issue. The issue is the one off that come in during a Google update and view a thread 20 times a minute for an hour waiting for a new message to pop up to see what other people said. How do you account for that? The only way we can is by requiring login from some that look like bots. That in turn leads us back to the same issue all over again. I am not going to deny them content just because they are over eager – that is exactly who the site is there for and remaining available is more important than anything.

    Say you view 10 page views a minute for 10 minutes. (very common), and a admin sees it and requires your ip to login. How long do you leave that on? Do you do it for a day – a week – a month – a year – or permanently? We’ve been doing it close to perm for the last year. It is hard to go back and keep track of who were the most troubled ips.

    I think people don’t appreciate or forget who our typical member is that visits the site. Webmasters are tech savvy, offline browser knowledgeable, independent, fearless, and a touch of an attitude. Additionally, we are the only major forum with flat easily crawlable structure that can be downloaded by every offline browser listed on Tucows. Very few of those same offline browsers can download a forum with cgi based urls.

    > Viewing Google cache causes no resource drain on your server.

    It is a liability that can’t be exposed. DMCA actions, take down notices, copied in content, flames or trouble posts, and mass whole sale site theft out of the cache – is a trouble waiting to happen and easily avoided without exposing your content to that.

    > cookies

    I have tested that and am reconsidering just pure cookies required – which would most certainly lead us back to the same issue of “what is a bot” or “What is a human”. If you did do a pure cookies require setup, you’d absolutely have to pure ip cloak for the engines, or you’d ban them too. So it is far from perfect.

    Yes, there are numerous bots running in the forums that support cookies. It is dog easy to write a bot that supports cookies. However, when you require a validated login, it raises the bar and lowers the rate of abuse. You can also swap out cookies from time to time and rerequire a login at a threshold. Which we also do.

    > But Russia, Korea, Ukraine, China, Nigeria, etc. Definitely.

    We can’t do that on a server level, because we have legitimate members – many subscribers – from those countries. That’s who the site is there for.

    What if we started to publish the log entries, the IP’s of the bots, and also the letters of complaint to the ISP’s? See here: http://www.webmasterworld.com/search_engine_spiders/3233952.htm

    Also, see this article by Google’s Vice President & Chief Internet Evangelist Vint Cerf, “Botnet ‘pandemic’ threatens internet”
    http://news.bbc.co.uk/1/hi/business/6298641.stm

  56. Chris; It’s not cloaking. You are using your definition. I’ll use mine. Mine is certainly more clear and certainly more user friendly. Using your’s; it would also mean that if I detect users with flash without flash installed, which also includes googlebot, and send them to the all html page, then I’m also cloaking, right?

    I do agree with you that maybe people who have subscription based content should not expect to be in the SERPS. Maybe they shouldn’t expect it, but not allowing it in is hurting the user experience as well as you are not giving the user the “option” to click and register. Having some option is better than no option at all.

    There are other things Brett is doing which makes this a difficult problem. There’s more to it than keeping out bad bots. As Adam said, it “is” his problem. I agree that it’s really not Google’s problem. It’s only their problem in that they should put in a disclaimer for the page in the serps. Making it Google’s problem says you are saying that Brett is spamming Google, which he clearly is not.

  57. Where is Phil Craven when I need him? 🙂

  58. Matt,

    I really don’t understand why you’re giving any credibility to Philipp Lenssen’s half-cocked opinion about WebmasterWorld cloaking when he obviously doesn’t know what he’s talking about. Philipp should be careful about calling someone a cloaker, and I’m surprised you even jumped on this, when in fact they are NOT CLOAKING as being labeled as such is just as libelous as calling someone a spammer when they aren’t.

    I’ve seen what Brett’s doing and it’s not cloaking whatsoever. It’s called SECURITY and he’s setting some BAD NEIGHBORHOODS on the internet that are involved with scraping. This requires a LOGIN to access his site for some ‘net neighborhoods while others are wide open and see WMW content without limitation.

    FWIW, if I’m not mistaken there are some major sites that allow Google to freely crawl their premium content but require a FREE registration to access that same content.

    How is this cloaking?

    Heck, I put a CAPTCHA up on first access from some REALLY BAD NEIGHBORHOODS as well just because they’re a mix of humans and bots it’s the only way to let humans in yet stop the bots dead in the water.

    I don’t expect WebmasterWorld to change their behavior until they can effectively firewall the bots or Google can stop indexing scraped content on Made For AdSense sites which incentives them to attack WMW in the first place.

    Overall, I’m really surprised about your attitude about what kind of access control security a web site employs, which is NOT CLOAKING, and would run off and post this based using such a half-baked misinformed article as Philipp’s as evidence.

  59. I usually use the swf object javascript source to embed my swf files into the htm document.
    I serve the users that dont have the correct flashplayer (like the googlebot) installed a text alternative, is this considered claoking?

  60. Now tell us how you really feel Bill….. 😀

    So you see Chris; many actually have the correct definition of cloaking. BTW: Phil Craven shares the same definition, and many others. The reason? It’s a clear definition.

  61. Cloaking detecting SE bots via IP, UA, or both and serving them different content than everyone else.

    Detecting users who do not have flash installed and redirecting them is NOT cloaking because it doesn’t matter if you’re a search engine or not. I could turn off flash in my browser and see the exact same thing Google sees. That is the difference. Of course a better method is to have the normal textual version be the default version and forward/popup a window for people with flash, the opposite method.

    What Brett was doing was detecting based on IP and User Agent and serving specific search engines different content based on those systems.

    It wasn’t a case of a “bad neightborhood” like IncrediBill is saying, it was everyone who wasn’t a search engine.

    Brett, obviously, has made some changes and will no doubt continue to make changes, and I’m not saying he’s cloaking now after the changes, but he was cloaking before. He might not be doing it anymore, but sometime after Nov of 2005 when he said he banned all bots but “wasn’t worried” about losing SERP placements he reversed course and started cloaking.

    It is a good thing that Google has finally stopped what many viewed as preferential treatment.

    And I like said above, WMW is not the largest nor the most popular webmaster forum, let alone forum in general. There are thousands of larger and more popular content sites and I don’t know of a single one that needs, or needed to, cloak to prevent site rippers.

  62. Sorry Doug, your definition of cloaking just isn’t correct.

    When it comes right down to it? Who defines cloaking? Not you, not me, probably the search engines… and I think the search engines have spoken.

    The fact is, does it really matter what your definition is if no one uses it for the basis of whether or not a site gets banned or penalized? You can make up any definition you want, but unless its actually used by the people who matter (ie Matt Cutts & Google) then what use is it?

    I don’t get why you’re defending a blackhat technique to begin with. Even if you buy Brett’s weak bot argument, there is no altruistic justification for disallowing Google’s cache.

    If you’re serious about long term SEO do you really want to be supporting a pushing of the envelope in regards to black hat techniques? “Its okay to spam if you’re doing it for the right reasons.” Is that the example people should see?

    I don’t think so.

  63. >> Viewing Google cache causes no resource drain on your server

    Chris,

    My site is under attack much like Brett’s and I evangelized disabling cache in all search engines at both SES ’06 in San Jose and Chicago and at PubCon ’06.

    One of the main reasons I advocate NOARCHIVE is because Google’s cache is also a scraping target and indexed by MFA (Made For AdSense) sites which then fight for your own position with your own content in Google.

    Additionally, scrapers can bypass spider traps and honey pots installed on your website by avoiding robots.txt by acquiring lists if links from the search engine cache, such as grabbing a copy of your SITEMAP and other things.

    So it *IS* because of ‘bots why some of us disable CACHE and using LOGINS or CAPTCHAS to control security from bad ‘net neighborhoods has nothing to do with cloaking.

  64. >>The fact is, does it really matter what your definition is if no one uses it for the basis of whether or not a site gets banned or penalized? You can make up any definition you want, but unless its actually used by the people who matter (ie Matt Cutts & Google) then what use is it?

    Huh? Where did I say anything of the sort?? I did not say that cloaking is whether or not a site gets banned or penalized. LOL

    What I said is that cloaking is “very” specific. It’s detecting a bot for the SOLE reason to rank in the serps, and sending that bot to a page other than Sally would see. Brett is not doing that.

    >>Cloaking detecting SE bots via IP, UA, or both and serving them different content than everyone else.

    Well yes, but in Brett’s case there are real reasons for doing that. He is also doing many more things than that as he does allow users already subscribed to view the threads. He is simply allowing Google to view the threads as well. It’s not like Google does not know this so it is not deceiving Google whatsoever. That is not the definition of cloaking and never has been.

    I mean, let’s get real here; You know I’d be the very first to say … WMW is cloaking if that were indeed the case. Cloaking is “always” search engine spam. WMW is not cloaking. WMW is not doing what they are doing because they want to deceive Google, and not doing what they are doing to gain some kind of serp advantage. I think we know those threads would be up there in the SERPs no matter what was going on. sheesh Brett is no dummy.

    >>If you’re serious about long term SEO do you really want to be supporting a pushing of the envelope in regards to black hat techniques?

    I think we all know how serious I am about this industry. I also think we all know I do not support blackhat techniques…. well Duh?? LOL

    We will have to agree to disagree about the definition.

  65. As a matter of fact, us fat boys have long memorys, just like elephants and remember things that Matt posted in March 2006 about needing to sign up in WebmasterWorld:
    http://www.mattcutts.com/blog/how-to-sign-up-for-webmasterworld/

    So Matt, would you care to explain how in exactly 362 days you went from telling people HOW to sign up for WebmasterWorld as you knew they were getting a login page to now claiming they’re CLOAKING based on Philipp Lenssen’s link-bait posts?

    Suddenly this entire thread isn’t passing the sniff test.

  66. Bill; Matt didn’t say WMW was cloaking. Or did I read his post correctly?

    What I disagree with Matt in his post is that he stated that he is finding that people are cloaking for other reasons other than se spam. Cloaking is spam. If it’s not spam, then it’s not cloaking and simply another form of content delivery.

  67. >>Huh? Where did I say anything of the sort?? I did not say that cloaking is whether or not a site gets banned or penalized. LOL

    Where did I say it was? I said the definition we should all be using is the definition used by those who have the power to ban or penalize, IE the only people who matter.

    >>What I said is that cloaking is “very” specific. It’s detecting a bot for the SOLE reason to rank in the serps, and sending that bot to a page other than Sally would see. Brett is not doing that.

    Yes, he is. He banned all bots in Nov of 2005. You probably remember that, it was discussed in many places. He said at the time he wasn’t worried about losing SERP positions. Then he apparently changed his mind, unbanned bots, and started cloaking instead to regain the SERP positions he lost. That exactly fits your definition.

    But you emphasized the word SOLE, so you must think its okay to cloak if you’re trying to get to the top of the SERPs, so long as you’re already trying to do something else with it? Seems a strange definition to me.

    >>I also think we all know I do not support blackhat techniques…. well Duh?? LOL

    I think you’d like to think that, but I also think you’re merely redefining “blackhat” to suit your opinion.

    >> It’s not like Google does not know this so it is not deceiving Google whatsoever. That is not the definition of cloaking and never has been.

    Of course Google knows, hence this post, hence Matt Cutts saying WMW could end up banned if they didn’t change their system. I wouldn’t use the fact that “Google knows” as support for your argument when Matt is saying that it is against their guidelines and will result in punitive action.

    >>One of the main reasons I advocate NOARCHIVE is because Google’s cache is also a scraping target and indexed by MFA (Made For AdSense) sites which then fight for your own position with your own content in Google.

    You mind showing me an example? Just any example of a remotely popular keyword phrase in which a scraper site outranks a legit site with the exact same content?

    Not that it isn’t possible, considering many scraping spammers are better at fundamental SEO than the sites they steal from, but Google has gotten a lot better at fighting that type of spam, it also takes only a minute to fill out a DMCA request & a Google spam report in the rare case that a spammer does outrank you by stealing your content. Open up your word template, copy and paste a few URLs, print it, sign it, fax it.

    Granted Google does have to do a lot more, specifically hurt the spammers in the wallet by not allowing such publishers into Adsense. But I think you’re blowing the scraper problem out of proportion. Most scraper sites are crappy, ugly, banned shortly after launch, inspire users to click the back button, and gain no quality incoming links. If you’re having a problem beating them in SERPs on keyphrases you’re

    This all being said, IMO a captcha would be alright. Require people to do a captcha or human check of some sort, give SEs a free pass. There is no benefit to the webmaster, unlike require registration, which gives the webmaster another user, lead, email to market to.

  68. The reasons are irrelevant, some are for good some are for bad. What you call it doesn’t matter at all.

    The text on the snippet shown in Google’s results should be viewable on the page they are delivered to that’s it. If the page doesn’t do that then it shouldnt be indexed much less ranked in the results page.

    I think the real lesson here is that there is no automatically detectable way for Google to find cloaking and it takes 6 months of blog posts on a fairly popular blog to inspire at least a manual review. So for all of you wanting to cloak/redirect/etc don’t worry unless someone with a lot of visablity blogs about it, and then you still have at least half a year to build your mailing list and make the hard sell.

  69. I wish I could find the thread with PhilC talking about this same thing.

    We do agree on the disclaimer though. That’s a good thing so let’s build on that. 🙂

  70. The problem with a meta tag driven disclaimer or something else is that it’d be difficult to impossible to police. I don’t think Google wants to manually review every site in their index.

    Undoubtedly, despite what some people think, Google does have automated cloaking detection systems. If you implemented a system whereby a webmaster indicated they were cloaking for such-and-such okay reason and Google should display a notice in the SERPs, then all a spammer would have to do is issue such a notice as well and be immune to any automated detecting system. In short, it requires complete trust in the honesty of webmasters, and you just can’t trust people like that (see AltaVista circa 1997). This would result in more spam that’d need to be manually reviewed and removed.

    The other option is to have Google manually assign such indicators, but again, this is a manual review system, something I doubt Google would go for, and all the spammer would have to do is wait until after his review to start his nefarious activities.

    The best solution is for webmasters to think of better ways than cloaking to solve their problems.

  71. Thank you for finally addressing this issue about “cloaking” and WMW.

    I have read Brett’s comments on the subject and I can understand why he thinks it’s necessary.

    I propose we look at the real issue — the very (VERY) real issue of rogue bots harvesting any and every content found on the WWW.

    I think Google should not consider this cloaking. Things have changed so much in the last couple of years that it is now normal to have every unimaginable bot on your web site at any given time. It’s mad. Some of these bots are so badly written, they aggresively fetch up to 5 pages a second — for hours! Then there’s the “referrer-spam” bots that are simply out of this world!

    I think Google should re-consider lumping this together as cloaking. Maybe Google should come up with an acceptable guideline for webmasters who wish to block or prevent such abuses.

    Maybe re-directing to a login page is not acceptable. Maybe what is acceptable is a re-direct to page explaining that the IP is banned for repeated abuse, used by a rogue bot in the past, etc. and how a regular user may view the content (i.e. requires registration or login), and a link to contact the webmaster for assistance.

    I must admit honestly, that I have had to implement something similar recently to limit the abuse on my web site. However the algorithm I wrote also automatically white-lists traffic from banned IPs to allow them access to the site if they are found to be real users.

  72. Bill; Matt didn’t say WMW was cloaking. Or did I read his post correctly?

    Doug, the whole article is mostly about WMW and their possibly cloaking and consideration for being dumped from Google for simple site security options. Just because you can access WMW freely from one location and not others does NOT make it cloaking nor does it justify WMW being the center piece of a cloaking article that implies they were being subject to being removed for protecting their intellectual property and being able to maintain adequate bandwidth and server performance for all members, many free and some paid subscriptions.

    Whether WMW was actually dumped from Google now isn’t the issue as it’s obvious they’re under scrutiny. My problem is with what happens the NEXT TIME Matt clicks a link and either isn’t a) logged into WMW or b) is trying to access WMW from a bad ‘net neighborhood.

    Does WMW get the boot from such a simple misunderstanding?

    Will I get the boot for doing almost the same identical thing with my site?

    How does Google have the right to tell webmasters not to protect their content from the very sites stealing and using it against us?

    Whether scrapers take ONE PAGE, the one indexed and displayed in Google SERPs, TEN PAGES. or a THOUSAND PAGES, the damage can be just as bad regardless of volume. With the imposed rule from Google that you must be able to see at least the ONE page indexed from the SERP, then using a slew of anon proxy servers a scraper can theoretically rip off of an entire site since Google has now forced the front door to be opened in bad neighborhoods because of misguided whining.

    What happens when the FIRST PAGE is freely accessible but the same person tries to access the SECOND page from WMW being shown in yet another Google SERP and gets met with a login request, are they cloaking now if it’s a one page limit for free?

    OK, so what happens next?

    Does WMW have to allow everyone free access to all pages as long as there’s a referral from Google’s SERPS when accessing the page?

    If you said Yes, then it’s a WRONG ANSWER because then you’re back to being full-on scraped and attacked as the scrapers can simply stuff faked Google referrers in their requests and it’s off to the races to stop them once again.

    Even this old post has many people screaming that WMW is cloaking when it’s nothing but simple site security aimed at certain parts of the internet causing trouble.
    http://www.mattcutts.com/blog/how-to-sign-up-for-webmasterworld/

    – I didn’t have the cycles to deal with it right then, but earlier this year I made it clear that WMW would be removed if it met the definition of cloaking when I tested it.

    OK, please take some time and narrowly DEFINE CLOAKING!

    If WMW did a bait and switch, which was show you one type of content then display another type of content when you access the page, which scrapers and Made For AdSense sites do, I’d agree this is cloaking.

    However, WMW simply asks SOME people for a LOGIN, which is not content whatsoever, it’s a SECURITY measure. Accessing the page indexed in Google is still FREE and if they don’t feel like becoming a member then it’s fine for them to walk away and look elsewhere.

    When your policy on cloaking potentially interferes with the ability for a service like WMW to actually keep their servers online then you need to review your policy as a LOGIN page or CAPTCHA used to stop abuse is NOT CLOAKING!

    To the others that argue you should be able to surf anonymously then go somewhere other than WMW. Trying to tell WMW how to run their business without being able to protect themselves from rogue ‘bots basically is putting WMW up against the wall because if they bow to pressure to remove the login from certain areas the ‘bots can knock them offline.

    You can’t surf the site anonymously or otherwise when it’s knocked offline.

  73. Interesting debate happening here.

  74. …you went from telling people HOW to sign up for WebmasterWorld as you knew they were getting a login page to now claiming they’re CLOAKING based on Philipp Lenssen’s link-bait posts?

    Sorry Matt, I was writing hurriedly and wrote the above wrong. Matt didn’t claim WMW was cloaking, Philipp did. However, Matt DID entertain the idea and checked it out with the possibiligy of removing WMW even though Matt posted a year ago regarding the LOGIN page and this was nothing new.

    Chris Said,

    The best solution is for webmasters to think of better ways than cloaking to solve their problems.

    I think I’ll just do what I’ve been doing until this is officially labeled as cloaking because it’s NOT, it’s a security access method for bad ‘net hoods. Until you run a site getting upwards of 1M visitors a month you have NO CLUE what’s involved in keeping a server under that type of load online when the bad boys come knocking.

    Your cloak is my door man.

  75. I do run sites with 7 figure uniques a month. And I have for years. I’ve had bots slow down my servers, I’ve had site rippers even crash them. I’ve adapted, improved hardware, banned bad ips, and created caching systems. I’ve never cloaked.

  76. I use 3 different ISPs to access the net, depending on where I am at the time.

    On two of them, WMW asks me to sign in if I arrive there from a SERP, or from a link on another site. On one ISP it does not. From one of those ISPs that does ask me to sign in, I know that thousands of non web-savvy users have inadverently let a myriad of trojans and bots into their systems.

    Showing a log-in page to users is NOT cloaking. Showing a keyword-filled page to bots and some other content to humans IS cloaking. A log-in page is NOT content, it is a log-in page.

    Brett shows to bots the same page of content that a logged-in user sees. That can’t be cloaking. The bot sees the actual content. If you are not logged-in then you are asked to do so if you look like a bot, or come from an ISP that has a history of abuse.

    Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here.

    *** I don’t have a “paid” membership to WMW and have always been bothered by any traces of it in the serps. People also link to it from their blogs, you go there and you can’t even signin unless you pay, lame! ***

    You don’t HAVE to pay to join WMW. All but a couple of sub-forums are available in the free membership class.

    *** I think most people would call what WMW was doing cloaking. ***

    I’ll add my name to the “not coaking” list.

    As for numbers, I know I can easily get through several hundred “pages” in a day on WMW. Logging in, navigating to and then reading the index page, then the forum thread list, 3 pages of a thread, writing a reply, then editing it, and going back to the thread list already accounts for at least 15 “page” views”.

  77. Let’s examine the Google Webmaster Guidelines and HELP for definitions of cloaking and see if there is any reference to a LOGIN or other kinds of web site security in relationship to cloaking.

    Quality guidelines – basic principles

    Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”

    Nothing deceptive on WMW as users and search engines see the same content. Some users see the page after a LOGIN, so it’s not fooling the search engine, it’s authentication of a human, no rule breaking here.

    Why does Google remove sites from the Google index?

    However, certain actions such as cloaking, writing text in such a way that it can be seen by search engines but not by users, or setting up pages/links with the sole purpose of fooling search engines may result in removal from our index.

    OK, anyone fooled here?

    Many big sites require a login to access their content, and WMW only requires SOME people to login that are in bad ‘net hoods, so LOGIN and you see what’s in the SERP, no fooling!

    The summary page says that my site is currently not indexed due to violations to the webmaster guidelines. What does this mean?
    However, certain actions such as cloaking, writing text in such a way that it can be seen by search engines but not by users, or setting up pages/links with the sole purpose of fooling search engines may result in permanent removal from our index.

    Again, there is absolutely no mention that a security method is disallowed, just that you shouldn’t write to FOOL search engines which again is not the case on WMW.

    How can I create a Google-friendly site?
    Don’t fill your page with lists of keywords, attempt to “cloak” pages, or put up “crawler only” pages. If your site contains pages, links, or text that you don’t intend visitors to see, Google considers those links and pages deceptive and may ignore your site.

    Yet again, there’s nothing being shown the search engines that aren’t intended for the user to see.

    As a matter of fact, nowhere could I find any reference in the Webmaster Help Center on Google regarding sites that employ security measures or login other than the explain how a 401 (Not authorized) error might cause a page from being indexed.

    I think this issue needs to be officially addressed on the Webmaster Help Center before anyone using security to protect their servers gets unduly penalized as cloaking when it’s simply not true.

  78. >>Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here.

    Uhh… yes it is. The threads shown to the SEs are full of keywords. The login page is not.

    It isn’t okay just because user registration is free. People are required to sign up, provide an email address, etc. Do you think there is no value in having more registered users and more emails to market to?

    Its fine if you want to require registration in your forum, but it is not fine to trick users into visiting there. That is the bottom line, it is deceptive. The content represented in the SERPs is not the content you see when you follow the link.

  79. JLH has nailed my concern on the head when he said …

    “The reasons are irrelevant, some are for good some are for bad. What you call it doesn’t matter at all.

    The text on the snippet shown in Google’s results should be viewable on the page they are delivered to that’s it. If the page doesn’t do that then it shouldnt be indexed much less ranked in the results page.”

    While on this forum I am sure many technical needs and theories will come up and be discussed, but what is important to the general user of Google is…

    “Do I find what I want when I do a Google search?”

    The answer is no if when someone using Google finds a page that seems to be exactly what they want in a Google snippet and then are directed to another page, registration or any other page. The thing is that Google is not delivering, if a site wants to redirect Google users to a specific page then I feel the only page that Google should index is the page that the users are getting redirected to anyway.

  80. People are required to sign up, provide an email address, etc.

    Interstitial authentication of human vs bot is not deceptive and people can go elsewhere if they want anonymity.

    Besides, you see anyone holding a gun to their head to fill in that form?

    If you want the information you may have no choice but to fill in that form or simply look elsewhere. I find links in blog posts to big media sites all the time that ask for me to login for free to gain access to the article.

    The only job the search engine is to make me aware of the content in the first place. The choice is MINE, not the search engines, to determine or not to complete the interstitial authentication process or just go away.

  81. *** *** Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here. *** ***

    *** Uhh… yes it is. The threads shown to the SEs are full of keywords. The login page is not. ***

    It is interesting that you chose to only quote part of what I said.

    Let me get some facts straight.

    A log-in page is NOT content.

    The bot gets the same content that a logged-in user gets. That is NOT cloaking.

    Cloaking would be where a logged-in user saw the thread, but the bot was fed a page stuffed full of keyword lists, a page that a user could NEVER see.

    In this case, users CAN see the same page that the bot sees – IF they log in.

    There are two types of page here:

    1. Content that the bot gets, and content that a logged-in human gets. These are identical. There is no cloaking.

    2. A log-in page. You get this if you appear to be an untrusted bot, or a user coming from an untrustworthy ISP.

  82. Well, obviously the people that matter disagree with you. It is clearly against SE guidelines, other sites have gotten banned or penalized for the same thing. Defending it probably isn’t going to work no matter your zeal. Remember, Google’s definition (hint, they matter) says that the content seen when clicking on the link in the SERP should be the same as what Googlebot sees.

    I am curious to know… if the subscription wasn’t free, if you had to pay, would it still not be cloaking?

    What if you had to sign up for an incentivized affiliate offer to activate your subscription? Still not cloaking?

    How many hoops would a user have to jump through to get the content (content that they thought was free and available just by clicking) before you would consider it cloaking? Where are you going to draw the line?

  83. Matt – you started quite the debate here…

  84. It is clearly against SE guidelines

    What part of my post above did you miss where there is absolutely NO MENTION of interstitial authentication is never mentioned in reference to cloaking?

    Not only is it clearly NOT against the SE guidelines, LOGIN is never mentioned.

  85. One thing I’m very curious about Matt (and since you started the thread, you opened the door for it), just what is Google’s official stance regarding sites like the Wall Street Journal?

    Near as I can figure from what I see, Google apparently thinks it’s perfectly okay to return WSJ content in the SERPs, then have the WSJ redirect folks to a “Hey, you gotta’ pay to see this” page.

    Just color me curious 😉

  86. I went back to a post I made on Search Engine Watch almost 2 years ago on this subject and Google’s “First Click Free” program. The following is a snippet of the program details (I don’t know if the program is still in effect as I am no longer with the company that I was offered the program at and I haven’t heard anything of it since but this is right in line with the WMW situation):

    If you offer subscription-based access to your website content, or if users must register to access your content, then search engines cannot access some of your site’s most relevant, valuable content.

    Implementing Google’s First Click Free (FCF) for your content allows you to include your premium content in Google’s search index. First Click Free has two main goals:

    1. Including highly relevant, premium content to Google’s search index provides a better experience for Google users who may not have known that content existed.

    2. Promoting sales of or subscriptions to premium content for Google partners.

    To implement FCF, you need to allow all users who find your page using Google search to see the full text of the document that the user found in Google’s search results, even if they have not registered or subscribed to see that content. Thus, the user’s first click to your premium content area is free. However, you can block the user with a login or payment request when he tries to click away from that page to another section of your premium content site.

    Thus, FCF is designed to protect your content while allowing for its inclusion in Google’s search index.

  87. *** Well, obviously the people that matter disagree with you. ***

    LOL.

    Too Funny.

    You’re still wrong.

  88. Now you guys see what you did? You see? You woke up Bill! Looks like he’s been drinkin’ that red Kool-Aid he calls beer and he’s on a sugar high, too. 😀

    I still haven’t seen one question answered in all of this, so I’ll reask the question using a different tack:

    If any of us were to visit a page from a SERP that did not contain the information we sought, or if we were in some way not 100% satisfied with the site containing the page, we would hit the back button on our browser and try a different page from the same SERP, or maybe a different search engine, or maybe we’d go do something crazy like use the phone book or something. In other words, we’d find another avenue.

    Why can’t those of you who are complaining do the same thing for subscription-based content, if you feel so strongly about it?

    Look at it this way: if you do, Bill won’t be angry about this issue any more and he’ll be down to 999,999 more things to be legitimately pissed off about. 😀

  89. Firstly many scrapers do not spoof their user agent, so your first step is banning all said user agents. Secondly many do spoof their useragent but do so in an unstandard way (an IE useragent in a format no actual IE installation uses for instance).

    BTW, just thought I’d point out that this is patently false and I have almost 12 months of scraper data to back up my claims which you’re more than welcome to see any time. The most offensive scrapers just cut ‘n paste actual user agents from log files, real user agents.

    As a matter of fact, some site downloaders comes with IE 6 as the default UA, but thanks for playing.

  90. Wow Matt. How do you get any work done? I got 1/10th of the way down and realized I have to stop reading all the comments =P

  91. g1smd wrote:
    >>>The bot gets the same content that a logged-in user gets. That is NOT cloaking.
    Cloaking would be where a logged-in user saw the thread, but the bot was fed a page stuffed full of keyword lists, a page that a user could NEVER see.
    In this case, users CAN see the same page that the bot sees – IF they log in.

    Exactly.

    Gee Chris; I’m finding that there are many, many in our industry that seem to have the same definition of cloaking as myself. Not only quite a few in this thread, but quite a few others out there who have not wrote anything yet. 🙂

    Yes Bill; I didn’t think Matt claimed WMW was cloaking and know the article is kind of claiming as such. I also believe the Google definition of cloaking is exactly my definition as well. They certainly wouldn’t want to “expand” the definition to include all forms of content delivery.

    Cloaking is always search engine spam.

    The answer to the member’s question about WSJ is that it is “not” cloaking as it’s basically the same thing. They are showing all user agents who are “not” subscribed the sign-in page. They are showing all bots and users who are subscribed the content page. Not cloaking.

  92. any cloaking was probably designed with an intention to register users

    That might be an added benefit, but WMW was being knocked offline by abuse and it was designed to keep the servers up.

  93. >BTW, just thought I’d point out that this is patently false and I have almost 12 months of scraper data to back up my claims which you’re more than welcome to see any time. The most offensive scrapers just cut ‘n paste actual user agents from log files, real user agents.

    Bill do you ever NOT speak in absolutes? I never said all, nor 100%

    >Why can’t those of you who are complaining do the same thing for subscription-based content, if you feel so strongly about it?

    I don’t care if WMW requires registration, I don’t use the site as I prefer factual based dicussions. Its just that all of Tabke’s yesmen here were coming down on Matt & Google for finally addressing this issue. I obvious support Google’s efforts to combat spam. I do run Google-Watch-Watch, it shouldn’t be a suprise I support them.

    >Gee Chris; I’m finding that there are many, many in our industry that seem to have the same definition of cloaking as myself.

    Quantity != quality. How many people believe outgoing links help, links from .edu are better, an over-optimization penalty caused the problems after the Florida update? And how many people who are supporting the claim that cloaking… or excuse me “ip based content delivery to search engines” is okay depending on your motivation for doing so purely because they do it themselves and so want to keep the practice possible? Not everyone wears a white hat.

    >Yes Bill; I didn’t think Matt claimed WMW was cloaking and know the article is kind of claiming as such. I also believe the Google definition of cloaking is exactly my definition as well.

    Google is using your definition huh?

    In this post Matt basically says that not showing the content that generated the SERP position is cloaking. Not in so many words I know, but he says that there was an allegation that WMW was doing the above. That if he tested it and it met the definition of cloaking they would be penalized, and that it seems Brett made some changes because now most visitors are seeing the content so they’re in the clear (so far). If he were using your definition he would have no need to investigate as the alleged activity wouldn’t be against their guidelines.

    Then of course there is the horse’s mouth:
    http://www.google.com/support/webmasters/bin/answer.py?answer=35769
    >Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”

    Oh, and is the WSJ really cloaking? I know they require registration but I wasn’t aware of them cloaking. I just did a cursory check and any page listed in the SERPs from WSJ that results in a login page once clicking through shows no abstract, which I take to mean Google has indexed the content-less login page and not the article page I’d see if I logged in.

    But if you have an example by all means post it.

  94. Interresting to see the many defenitions on the issue ‘cloaking’.

    If stumbled at WMW’s registration page too many times. And it really bugged me. Why didn’t I get to see the content, as what Google did see when the spider came by.

    The people that are stating that feeding content that isn’t visuable for guests but visible to bots, isn’t cloaking, because a registered user sees the same content. Then I’m thinking.. How would a spider log in ? They don’t so they never could see the content.

    If there’s content behind a login, it shouldn’t be shown in the results. I don’t need a dozen accounts to get information I’m seeking. I need the information.

    So.. putting content behind a login, but have a spider a free doorway to get it indexed, is in my oppinion cloaking. I’m getting another page, then the bot indexed.

    I’m also using the noarchive tag, but that’s because my content are all subject to be removed, and if it’s removed, it’s gone for a reason. And since SE’s don’t update that often, I don’t need a cached version of a removed page.

    The issue on the newspapers that did came up in some comments, is understandable. News on these sites are free, but after a few days, old news that gets archived, are viewable in the cache. In Belgium there was a lawsuit on that issue, that content that wasn’t free after a few days, and a payed account was needed, were still visible in Google’s Cache.
    Instead of that these webmasters added the noarchive tag, they started a lawsuit, and won it too..

    Most people are really anoyed by the fact that when they click a link from a search engine, they get the content, and not a registration/login page. If that’s the case, webmasters should block the spiders aswell as guests/unregistered users. Which i see as cloaking.. Feeding bots content that a guest can’t see. Bots don’t register or login..

  95. WMW robots.txt file says

    User-agent: *
    Disallow: /

    Kenki,

    Are you a robot?

    Why would you expect to be told permitted to crawl WMW?

    That’s called a dynamic robots.txt file and only those robots allowed to crawl see the real robots.txt file, and everyone else, including yourself, are told NO ROBOTS ALLOWED TO CRAWL without explicit permission.

  96. >>In this post Matt basically says that not showing the content that generated the SERP position is cloaking. Not in so many words I know, but he says that there was an allegation that WMW was doing the above.

    Not in so many words, but also not at all. Matt did not say Brett was cloaking. If Brett were indeed cloaking, then Google would remove the pages. WMW is not cloaking. If those pages are removed, then ALL pages implementing the same thing should be removed as well. In other words, most all good content on the internet would have to be removed from the google serps. How is that good for Google’s users? It’s not. It’s up to Google to show all pages that are relevant to a search. It’s not Google’s job to disallow all content just because those users who are not registered to view it cannot view the page without registering. Would it be a better search experience for the user who is not registered to just “never” find that good content? I don’t think so. The way Google has it now, it’s giving their users a “choice”. The choice is whether or not to register to view the content. They don’t have to register. They can hit the backbutton and click on another page in the serps. At least they DO get the choice. Isn’t this what Google should want? I would think Google wants their users to be happy campers…. not naive campers just because they never find the good content in the Google serps. As Adam stated; … the users have a choice.

    WMW is not cloaking. WSJ is not cloaking. Doug Heil is not cloaking when he puts his private forum into Google’s index. No site is cloaking when it allows all bots and all users without flash installed to view the html page and sends users with flash installed to the pretty flash page. Not cloaking. ALL are forms of content delivery. If I am detecting the IP address and sending users who live in Poland to the Polish based section of my site, and sending users who live in Mexico to the Mexican section of my site, and sending users who reside in California….(which includes Googlebot) to the English section of my site, it’s not cloaking. It’s simply a form of content delivery.

    Cloaking is always search engine spam.

  97. So your saying WMW are actually cloaking there robots.txt file to show different content to the user as to the robots…[snip]… but at the same time it is still cloaking.

    Um, NO… I said WMW is showing spiders authorized to crawl THEIR robots.txt and everyone else, humans, spiders, bots alike that are NOT authorized to crawl, are shown the KEEP OUT version of the file. This isn’t cloaking, this is just showing the proper content of the robots.txt file on a need-to-know basis as a security measure.

    Why?

    Because scrapers can look into your robots.txt file, see who’s allowed to crawl, change the user agent to crawl and get into places where they aren’t invited.

    BTW, the WMW robots.txt Perl code is publicly available and located here:
    http://www.webmasterworld.com/robots.txt?view=producecode

    Note it only allows slurp, msnbot, jeeves and googlebot. If you have anything else you want to crawl your site you’ll need to modify that code.

  98. You keep saying it’s cloaking. It’s not cloaking. LOL Is google or yahoo or MSN being deceived by his robots.txt file? Nope. It’s not cloaking so please don’t call it that. Thanks!

    I don’t think Brett really cares about “new” or other search engines. Neither would I. He will simply modify his code when he wants another bot to crawl and index his site. Easy stuff.

  99. hmm, hey Matt,

    Would a minor amount of “cloaking” get a site banned from the Google Index? Reason I ask if I had a site disappear from the Google Index in January as I had displayed 2 extra Google Adsense adverts if the user’s referrer was a Search Engine. I had asked the Google Adsense people if this sort of thing was ok, and the person that replied said it was fine as long I didn’t edit the actual Adsense code itself. Now I’m thinking it probably was this “cloaking”, even though everything minus the extra ads is normally seen by all users. I had no other explanation as to why the site had been banned otherwise, but this not makes some sense. If you could let me know if this is actually a bad thing, please do, as I assumed it wasn’t really cloaking, especially after getting the go-ahead from the Adsense staff, but then again, you guys do all have different rules, etc.. The site that got banned was this one.

    Thanks 🙂

    John

  100. I guess since the SEO reputation meme has had its twice yearly outing, we can get back to the endless useless insane debate about cloaking. Then maybe we can get back to if you should have spaces after elements of a meta tag. Oh, yeah, I actually DID have to get back to that last week.

    Enough. Seriously, enough. Do we have to do this every year or two. Does Doug have to start screaming about how cloaking is always “bad,” then someone else have to say oh no, that’s not cloaking and someone else say oh yes it is.

    It’s like we go nowhere. How much time are we yet again wasting on this topic, all frankly because of Google itself?

    Matt, Google especially painted itself into the entire “cloaking is bad” corner then didn’t want to walk across the wet paint as you started allowing people with registration-based content officially into the system.

    Brett — I love you, but honestly, those registration pages that come and go based on some complex combination of factors also can be confusing and make people think they either need to pay for or sometimes register for content that is free. If they were less confusing, you’d have fewer people poking at you.

    Marshall — I know you’re out there, what’s the deal with the NYT pages doing the potential cloaking dance again.

    Doug — nah, Doug, don’t even get me going.

    Two fixes here — and if I cared, I’d roll out links where I’ve written about them from like 2001, 2002, 2003, 2004 and so on.

    First, Matt — enough. If your guidelines don’t already say it, get them updated to say that “In SOME cases if there’s cloaking — especially cloaking that’s not authorized — Google might ban a page or sites.”

    There are times when Google wants to allow cloaking. No, Doug — cloaking is NOT always bad. There are times when a search engine may be perfectly happy showing a user something that is different than what a spider saw. I’ll leave it to Google to decide when this is the case, the right qualifiers immediately defuse the entire “they’re cheating — you aren’t being fair” stuff that continues to come up.

    Second, people have been saying more and more they dislike seeing content show up in search results if registration is required. Skip the “is that cloaking or not” argument here (I think I ended up agreeing it wasn’t last time we had this go around). It’s damn annoying. I ran a subscription site once. I would have loved an easy mechanism to feed those pages to Google without giving away the content to users. So let the registration people in. And then make sure those types of pages are clearly identified. And let searchers choose to disable seeing these pages in the search results, if they want.

    Matt — I’m begging here. Don’t let this end as another case-by-case type of thing. Another argument over cloaking, covering the same issues we’ve had for years, it’s doing my head in.

  101. BTW, you got it wrong Danny, Doug isn’t yelling about cloaking being bad, Doug says Brett IS NOT cloaking and has done nothing wrong.

    Wow, how did I miss a shout out from Danny in THIS thread?

    Ah, he glossed over what Doug actually said so maybe he missed my tirade as well 😉

  102. I understand Doug perfectly. We’ve had this argument many times.

    Doug sez cloaking is bad. All cloaking, even if search engines allow it.

    Doug sez what Brett is doing is NOT cloaking. Same for the NYT. I think I agreed with him on that, last time we did this dance.

    Some disagree with Doug and say what Brett and the NYT are doing is cloaking.

    I saw stop arguing about the tactics (is cloaking bad) and look at intent (is what Brett or the NYT doing bad for the user. And if Google thinks it’s fine, who cares if it is cloaking or not cloaking).

    But since I’m all into it now, I’ll do a big history lesson on Search Engine Land shortly. Sigh. Honestly, when I get it up, you’ll see there’s nothing, and I mean nothing, that hasn’t been discussed and suggested before.

  103. I’m starting to think some scrapers are in this thread based on a couple of comments. Looks like they would like to see any website security go away, such as dynamic robots.txt, that slows scrapers down.

    BTW, Danny gave me an interesting idea! How about a SUBSCRIPTION meta tag Matt? That way you can flag content in the SERPS that requires login for free or paid subscriptions and then all the rioters from Cloakville can go home and put their torches and pitchforks away.

  104. I could swear I’ve seen something in one of the SERPS, I think it’s from Google News that identifies subscription sites with (subscription) next to the attribution letting the user know that they are going to get smacked with a registration screen. The same should apply to regular search results.

  105. That’s a Danny’s really tired and shouldn’t even be up right now silence.

    I don’t know of other sites that cloak robots.txt. But I did write about this when Brett started:

    http://blog.searchenginewatch.com/blog/051222-113443

    Look close at that file, and you’ll see that it seems to still ban all the robots. Now look here at what the robots.txt file tells you is the “real” robots.txt file. That’s made real to the major search spiders through this code, which checks to see if a spider is reporting a useragent from any major search engines. If so, then a cloaked robots.txt file is sent to them.

    Cloaked! Cloaked! You mean Google and gang are all anti-cloaking but they don’t mind this cloaking? Apparently so, and not that surprising. The robots.txt file really isn’t designed to be read by humans, though they can.

    So I didn’t really see that as a big deal.

    FYI, if you want some fresh recap on cloaking, subscription content and all that, I have now compiled a bunch of links and stuff going back to 2003:
    http://searchengineland.com/070304-231603.php

    And now I got to sleep to dream of a world where cloaking once again means something you do in the Star Trek universe, unless you’re the Federation, where that treaty of something or another prevents it except for the case of the Defiant, which got all blowed up like Sgt Hulka in the end as well.

    Army training sir!

  106. keniki – i remember last august/september del.icio.us doing it and there was some talk about it but they do not appear to be doing it anymore…

  107. Hi,

    I dont want to sound stupid in front of the whole world, just wanted to ask u something personnel.

    I am an SEO guy and my niche is travel related stuff so eventually the site i look upto is citysearch.com, so i follow them pretty closely.

    I had noticed that their homepage was optimized for the keyword “city guides” for which i checked the cache and i saw it showed me the homepage but when i clicked on the url, i was redirected to the chicago.citysearch.com, isnt this a voilation for the same cloaking thingy.

    Last but not the least, thank you for such a blog, i never thought i would ever be able to talk/write to such big guy personally.

  108. Matt, I use Firefox with noscript and I often see pages with java script redirects. I have also seen pages which have keyword stuffed text at the top of the page, the real content below, with javascript used to hide the keyword stuffed junk.

  109. The objective here is simple. If a user types in a search phrase into the search engine and a finds highly relevant listing, when that listing is clicked on, there is a reasonable expectation that the user will arrive at the content that was “promised” in the search engine listings.

    Anything less is both unacceptable, and a major flaw on the part of the search engine serving the content… and this has a ready been established ad nauseam by the search engines themselves.

    WMW, is definitely going against the grain, and it baffles me that this practice is being allowed to continue. Seems a bit like dual standards!

  110. Why not just scroll up and read what Matt wrote?

    When I get a chance to tackle Philipp’s most recent report, I’ll be looking at consistency: when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.

    That’s it. You click from the SERP to the page. If you see something other than what the bot saw, that’s bad. Matt didn’t mention that seeing something different is acceptable if there are circumstances under which that wouldn’t happen, or some user action is required.

    I call that cloaking. It’s checking to see that a user-agent is a search engine spider and then showing it something because it’s a search engine spider. The fact that what is being delivered to the bot is also being delivered to some humans who have already met certain criteria, doesn’t change that. You don’t have to call it cloaking, but it is showing the bot something that the general public doesn’t see.

    Let’s say I’m doing a survey of roads to see which charge tolls and which don’t. Many of the toll roads allow emergency vehicles through without charging that toll. Now, if some of the roads are aware of what I’m doing and when they see me treat me like I’m driving an emergency vehicle, then my data is going to indicate that the road doesn’t charge a toll.

    When my readers take those roads and reach a toll booth, they’re going to wonder why my data was wrong.

  111. Hey everybody, just a reminder to stay polite/considerate in the comments please.

    Judging from the amount of interest, I may ask someone (Vanessa, Adam, or maybe someone else) to look into doing a webmaster blog post about best practices for publishers.

  112. I hope you take Danny’s suggestions about the “Registration Only” sites to heart. There seems to be universal agreement about how clicking on those types of sites without foreknowledge of how useless they are is extremely annoying and detrimental to the user experience.

  113. Matt same issue goes to all recognized world newspapers, times, cnn, nytimes etc. Most of their old articles are indexed and old enough but most those articles have a subsciption fee after one week.

    Some newspapers are using this to convert subscribers where the content is not what users wants.

  114. Not all cloaking is “bad” or “black hat”. UA/IP based content delivery of any kind is “cloaking” in its technical meaning. Providing keyword- and description meta data is not spammy just because some spammers abuse meta elements. Perhaps we just need another term to discuss the “dark side” of particular techniques widely used by Web developers? BS. It’s a question of intent, not usage. Think of search engine friendly cloaking:
    http://www.smart-it-consulting.com/article.htm?node=148&page=103

    Sebastian

  115. >>Doug sez cloaking is bad. All cloaking, even if search engines allow it.

    No Danny, sorry; I didn’t say that. Re-read what I’ve wrote in here and quote and reply to exactly what I wrote. Thanks!

    Cloaking is always search engine spam. That much I can say.

  116. I missed this from “Mr. Sullivan”;
    >>There are times when Google wants to allow cloaking. No, Doug — cloaking is NOT always bad. There are times when a search engine may be perfectly happy showing a user something that is different than what a spider saw. I’ll leave it to Google to decide when this is the case, the right qualifiers immediately defuse the entire “they’re cheating — you aren’t being fair” stuff that continues to come up.

    Google does not “allow” cloaking Danny. Subscription pages are Not cloaking. If a user is subscribed and clicks on the wmw link, he/she gets the same darn page a bot gets. How can it be cloaking when both get the same page?? It can’t be.

    Heck; we all cloak. Google cloaks. I cloak. We all cloak. Let’s just call “all forms” of content delivery cloaking. That would make things less confusing, right?

    If you are detecting the bot “only” and showing a page that NO other user agent gets, then you are cloaking and you are spamming the search engines.

    Brett is “not” cloaking. Google does not “allow” cloaking.

    We all agree that the serp listings the way should be changed, and is not good for users. We just don’t all agree about the way WMW is getting there. There is nothing wrong with the way they are getting there, but something needs to be done about the listing that the user sees in the serps.

  117. What seems to be missing from this discussion is the issue of scalability. Google is for the entire internet not just SEOs searching for things that happen to be in WMW. You can all agree/disagree that showing a log-in page is cloaking/not-cloaking, as I said before its irrelevant. What if the log-in page was used as a hard sell for pron or the aryan brotherhood. What if they’re using those names from the free sign-ups in not such a friendly way as rouge bot control.

    With some issues it’s sometimes easier to understand if you take both positions to the extreme and examine the possible consequences. The first extreme is if Google does not index any pages at all that utilize a sign-in requirement, the possible outcome is that the user gets the information that was shown in the snippet in one click. The other extreme is that Google doesn’t mind and in fact all websites eventually adopt this method of content delivery and every single search query requires the user to establish an account and login (for free or otherwise) to see the content. I’d say the former is preferable to the latter.

    Now, one could argue for a rational happy medium between the extremes, where some sites are allowed to and others are not. Without transparent disclosure this is a slippery slope that I’m sure Google would be wary of as it opens up a ton of ethical questions: Is this based on paying off the right person? Knowing the right person? Or on personal tastes, is WMW more worthy than some Hello Kitty Fan club? Will unpopular subjects be treated as fairly as popular ones? Will the DNC be allowed to have a sign-in while the RNC is not? How can I apply for this preferred status? Is it fair to have 99.999% if the sign-in page be an upsell to a paid subscription and at the very very bottom have a free sign in? Are there standards for what is done with those names collected? Is Google going to police the usage of the user lists generated at its expense? If I sign-up for a site that google sent me to are they going to guarantee my online safety? as indexing and ranking a site based on signed-in content sure looks like an endorsement to me.

    Another thing to consider is that the searcher is Google’s client, not the webmaster. If the searcher is unhappy with the page they land on, that’s an unhappy Google client. It doesn’t matter that the actual content was only a sign-in page away, if they clicked that link expecting to see the result for their query and didn’t get it, they may move on to another search engine.

  118. hmmm
    i don’t really care if it is cloaking or not,
    when i click on a link i want to see the indexed content displayed
    anything else is annoying and i would prefer search engines not to show it or to require me to click ‘show results with indirect/hidden content’ before they show on my results list
    – imma (random user)

  119. Hey Doug!

    A debate! On cloaking! Like old times! Maybe we shouldn’t change anything, what would we have to talk about? Boiled peanuts? Seriously, if you can score me some canned ones from South Carolina, we might have to hang out together. Been a long long time and Scottie consistently fails to deliver.

    > Cloaking is always search engine spam. That much I can say.

    Yeah, I said that. I said that cloaking is bad, that cloaking is evil, that to you cloaking is search engine spam. I firmly agree that the quote above is your viewpoint. I don’t know what you’ll ever do if some search engines say it’s not. Wait a minute, if it’s spam, how come Google doesn’t say you WILL be banned for cloaking but rather says you MAY be banned. Perhaps because they don’t always enforce spam guidelines if they don’t feel the intent is harmful to the search results.

    > Google does not “allow” cloaking Danny. Subscription pages are Not cloaking.

    Yeah, wasn’t talking about subscription pages, Doug. I’m talking about other cases where Google will come across pages they know are actually and honestly cloaking but decide not to worry about it.

    I totally agree with you, though. Showing the spider something that’s the same that you can ultimately see if you pay or register isn’t cloaking. Go read that article I just wrote. Completely on board with you there. We’re peas in a pod, two boiled peanuts in a shell and soaked in brine. Yum, love to suck that brine out of the shell.

    Given this, hey — Brett’s not doing any cloaking activity. And given that, why’s Matt kind of putting him on notice here, if he’s not cloaking. Because, Doug — while you and I might agree this isn’t cloaking, it’s still not what Google really wants. Google obviously does see it as cloaking, unless Google’s given permission for it to happen.

    I mean, this is a post Matt called “a quick word about cloaking.” And he says:

    “I won’t consider this issue closed until I have the time to investigate how consistently the return-the-same-content-as-Googlebot-saw behavior happens;”

    But Googlebot IS seeing what any user can see, if they register or pay. So that’s not cloaking, right? Clearly Matt thinks showing a chunk of people a registration page instead of the actual content is. Otherwise, he wouldn’t have bothered to do this post at all.

    > Let’s just call “all forms” of content delivery cloaking. That would make things less confusing, right?

    No. Let’s finally get off the entire focus on the technical aspect of what’s going on and look at intent. Let’s stop having to ticky-tack and debate whether cloaking is IP delivery no it’s agent based no it’s if you show country specific things blah blah blah blah blah.

    Anyway, got better things to do than spin yet more time on this issue for like the fifth year in a row. I’ve put out the solutions as I’ve seen it. We agree on the labeling search results front. We’ll never agree on the true cloaking is always bad front. I will agree that most of the time, people doing true cloaking are likely to get in trouble, nor is that something they should do — primarily because they are cloaking to hide poor content rather than actually work on improving their sites. But while you’re happy to shove your head in the sand and pretend that search engine never ever ever ever ever allow real honest cloaking, I know better. Plenty of other people do as well. And I don’t have a problem with that, because it’s their search engine — they’ll look at intent and decide what they want.

  120. It seems we agree on more things than I thought then.

    >>Wait a minute, if it’s spam, how come Google doesn’t say you WILL be banned for cloaking but rather says you MAY be banned.

    They say “may” as they do not and cannot discover all pages that are spamming them. They will eventually as all spam is a risk. They can’t say “will” be banned as that would be confusing to someone who discovers spam in the index. BUT: They could say this:

    “We do not allow cloaking. If we discover that your page is cloaking, we will ban you. We can discover this through automatic means or through manual reviews. Do not cloak.”

    They could easily state that. They are not right now, but they could. Matter of fact; Google should change the wording now.

    >>Doug. I’m talking about other cases where Google will come across pages they know are actually and honestly cloaking but decide not to worry about it.

    Not sure what you mean here. If you are saying that Matt or anyone at Google manually will review a site, and then turn their heads the other way and allow the cloaking to go on,… I disagree with you.

    If you are saying that their automatic ways might not detect the cloaking anytime soon…. I agree with you. “come across” can mean different things. Please be more clear about it.

    >>Given this, hey — Brett’s not doing any cloaking activity. And given that, why’s Matt kind of putting him on notice here, if he’s not cloaking. Because, Doug — while you and I might agree this isn’t cloaking, it’s still not what Google really wants. Google obviously does see it as cloaking, unless Google’s given permission for it to happen.

    Matt is not putting Brett on notice. Matt is responding to an article/blog by Philip. It’s not what Google really wants, but I think the way the listing appears is not what they really want. I don’t think they want to deny all content behind a login screen either. To deny all of that content would not be good for the results and for their users. Your last sentence doesn’t make sense though. You state that WMW is not cloaking and say it’s not what Google really wants, and then say Google obviously sees it as cloaking unless permission is granted. That’s waay too confusing. 🙂 If Google says he “is” cloaking, then the listing will be removed. I know Matt does not think it’s cloaking at all. He just doesn’t like the way it is now and will probably do something about the listing with some kind of disclaimer.

    Matt was responding to an article about WMW. He’s being kind and acknowledging there is a problem and wants to fix it. He’s never stated WMW is cloaking.

  121. No. Let’s finally get off the entire focus on the technical aspect of what’s going on and look at intent. Let’s stop having to ticky-tack and debate whether cloaking is IP delivery no it’s agent based no it’s if you show country specific things blah blah blah blah blah.

    If we’re going to look at intent, we’re not going to settle anything. If googlebot sees a page of useful information and I see a page that tells me that I can get a free 2-week trial if I fill out a form, was the webmaster’s intent to make sure I’m aware of this useful content because it’s going to enrich my life and they wouldn’t want me to miss out just because it requires a subscription, or was it to use that content to get me to sign up, thus giving them information that can use to market other things to me, and perhaps get me to buy a subscription.

    The former is a noble effort to give me access to information; the latter is a slick attempt to get something out of me.

    Of course, it’s up to Google to decide what the intent was. Does that mean they vote on it, or is it all up to whatever engineer happens to come across it first? Can Matt decide it’s ok, but then get overruled by someone higher up? If I get banned, do I get a chance to prove to them that my motives were pure?

  122. Yes, intent is measured by the search engines themselves.

    Yes, Matt can decide it’s OK.

    Yes, someone higher than Matt could override him. Of course, Matt’s about as high as you get in making these exact types of decisions. I suppose Larry and Sergey might walk in and slap him around. Otherwise, he’s pretty much the guy who does the overruling, if that needs to happen. Haven’t you ever seen his slapping hand?

    Yes, site owners get a chance to prove they were pure. It’s called a reinclusion request. The court is run by Google, of course. But it’s also their search engine.

  123. Matt,
    I think some of the confusion might also stem from Google News’ explicit support of cloaking, or at least what some would call cloaking:
    http://www.google.com/support/news_pub/bin/answer.py?answer=40543&topic=8871

    In order to include your news articles in Google News, our crawler needs to be able to access the content on your site. Currently, crawlers can’t fill out registration forms, nor do they support cookies. Given that, we need to be able to circumvent your registration page in order to successfully crawl your site.

    The easiest way to do this is to configure your webservers to not serve the registration page to our crawlers (when the User-Agent is “Googlebot”).

    While I understand the intent, telling people to serve different content to Googlebot (for Google News purposes) is the exact opposite of what you say in the Google Webmaster Guidelines.

    Also, Danny said:
    >>I would have loved an easy mechanism to feed those pages to Google without giving away the content to users. So let the registration people in. And then make sure those types of pages are clearly identified.

    This mechanism already exists for the Google News Archive. Expanding that to the general index could really solve some of this…

  124. >>Google does not “allow” cloaking Danny. Subscription pages are Not cloaking. If a user is subscribed and clicks on the wmw link, he/she gets the same darn page a bot gets. How can it be cloaking when both get the same page?? It can’t be.

    Mmm Doug, I have a couple of questions for you based on the above comment.

    Situation : I make some really content rich pages all about viagra, I serve these too googlebot. However, when anybody that is not a bot comes to the page I show them a page stuffed with Adsense and other adverts. On that page I also add a nice little box, maybe at the bottom, maybe quite small, but it’s definately there – called ‘Register to see the full content’.

    Once registered they see the exact same content I showed to googlebot.

    Q: Is this cloaking? Is this Spam?

  125. Doug:

    The answer to the member’s question about WSJ is that it is “not” cloaking as it’s basically the same thing. They are showing all user agents who are “not” subscribed the sign-in page. They are showing all bots and users who are subscribed the content page. Not cloaking.

    Well, I’m looking at this more as an “Is this good for the user” issue and in light of two other issues.

    One is Matt’s comment in his initial post:

    When I get a chance to tackle Philipp’s most recent report, I’ll be looking at consistency: when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.

    Two would be the fact that Google already has a mechanism for sites like the WSJ to get premium content into the Google News results – the First Click Free program. That’s where this otherwise inaccessbile content belongs, not the regular SERPs.

    Chris:

    Oh, and is the WSJ really cloaking? I know they require registration but I wasn’t aware of them cloaking. I just did a cursory check and any page listed in the SERPs from WSJ that results in a login page once clicking through shows no abstract, which I take to mean Google has indexed the content-less login page and not the article page I’d see if I logged in.

    Again, I’m not getting into the cloaking vs. content delivery debate, these are just my thoughts from a user’s point of view.

    Here’s a simple search:

    url=http://www.google.com/search?hl=en&q=george+bush+site%3Awsj.com&btnG=Google+Search

    It returns results from two WSJ subdomains, blogs dot and online dot. Pages from the blogs dot sub are accessible. Clicks on pages from the online dot sub bring me to a sign-in/subscribe page:

    http://users1.wsj.com/lmda/do/checkLogin?mg=wsj-interactive&url=http%3A%2F%2Finteractive.wsj.com%2Fdocuments%2Faugpoll.htm

    Since I am not a WSJ subscriber I must click on the “Subscribe Now” link. That brings me to a four-page subscription routine where I have to whip out my credit card.

    Paid content, not noitice. Is that a good user experience?

  126. I could use some of those peanuts too, actually. It’s getting harder and harder to buy anything like that around here because half the kids are walking around with Epipens and allergic to them.

    Come on, Doug, score me some. I’ll email you my mailing address if you promise to. 😉

  127. People keep overlooking that this issue only manifests for SOME of the WMW visitors, not ALL, it’s a security issue, not a cloak for all intensive purposes.

    What if you just issue a 403 to everyone in bad hoods where you get most of the botnet attacks, yet the rest of the world can view the site freely. Is THAT cloaking too since Google shows them a page but the website says FORBIDDEN? Technically, 403 is no content, but it’s still different than what Google shows as they see a broken site.

    Does this mean my Great Firewall of China to protect my server is against the rules because my content shows up in Google.cn?

    If I open up the firewall and put in a login to allow humans but slow the bots down coming from Asia am I then breaking the rules?

    I think a login as a solution is much less forbidding than a 403, pun intended, than just blocking entire countries. However, it appears how webmasters secure their sites is in jeopardy and we should just all roll over and submit to whatever we’re told to do while our servers go down in flames, our content is stolen and all over the SERPs, and worse.

    Here’s an analogy to ponder:

    Google is just a phone listing businesses on the web opposed to the Yellow Pages for the real world.

    If I go to your real world place of business from the Yellow Pages:

    – A jewelry store might be locked and you have to be buzzed in to gain access, isn’t that the same as a login?

    – Show up at Google’s HQ and you have to register at the front desk.

    – Go to a ritzy night club and the bouncer decides if you can get in or not.

    In all of these cases, the Yellow Pages (aka Google) tells you where these places are and what content they have, yet you’re not always going to be allowed to get inside the minute you walk up.

    So why are online businesses being dictated that to be in Google means you can’t have the same right to refuse service to anyone for any reason, or require an ID tag (login) to get inside just like at Google HQ in Mtn View?

    Hopefully the concept of the right to refuse to serve someone (or a whole country), or the right to refuse to serve with conditions, isn’t being lost here as I’m sick and tired of losing rights in how I run my business.

  128. >>So why are online businesses being dictated that to be in Google means you can’t have the same right to refuse service to anyone for any reason,

    They aren’t – they seem to be being told that they are welcome in the SERPs as long as a single click from those SERPs takes the visitor to the same content that caused the result to show in the SERPs in the first place.

    Shades of ‘cloaking’ based on security or any other reason, is just going to open a can of worms. See my example a few posts above – is that worse than WMW? Should that be denied, whilst WMW allowed?

  129. Shades of ‘cloaking’ based on security or any other reason, is just going to open a can of worms. See my example a few posts above – is that worse than WMW? Should that be denied, whilst WMW allowed?

    OK, so you’re saying it’s better to be wide open to scrapers that Google incentives with AdSense?

    If the scrapers can get to ONE PAGE thanks to arm twisting by Google, using proxy servers, especially Tor proxy servers, they’ll get to all your pages and start making a profit off your hard work.

    Calling site security cloaking is a WAY a bigger can of worms that I’ve been fighting for years now. Once we’re forced to allow those bad places to see ONE PAGE, we’ll be chasing scrapers and filing DMCA complaints, lawsuits and worse on a regular basis again.

    Why should Google or any SE be permitted to put us in that precarious situation when there is NO INTENT to fool the user or be deceptive?

  130. This is certainly a major problem. Google “cannot” deny sites just because they want to protect themselves, nor can they deny content for the same reasons, and many more reasons, most of which have to do with giving their users a choice.

    I feel there is really only one option here and that option is detailed in this thread.

    Good analogy on the brick and mortar stores Bill.

  131. OK, so you’re saying it’s better to be wide open to scrapers that Google incentives with AdSense?

    No, I am suggesting that Google should make it clear what type of cloaking is allowed – and that should probably be as mentioned above – perhaps a cloaked page can display nothing more than a CAPCHA to the real content – with a brief, Google / SE approved statement explaining why the visitor has seen that page, rather than the content they thought they were going to see.

    With something ‘standard’, users will, in time, understand the process and it won’t be a pita.

    Adding registration forms, adverts, links to other sites / pages etc etc is cloaking the page for benefits other than the ‘security’ / ‘bot’ reasons detailed – and those sites you be removed from the SERPS as per the “don’t cloak” guidelines printed many times above.

  132. I want my flying car that was promised to me 20 years ago! Or at least the subscription based indexing Adam mentions here in November:

    http://groups.google.com/group/Google_Webmaster_Help-Indexing/msg/f5ad66d26faa166d?hl=en&

  133. >>No, I am suggesting that Google should make it clear what type of cloaking is allowed

    So now we have different “types” of cloaking? In other words, all forms of content delivery are called cloaking, right? No cloaking at all should be allowed. However; types of content delivery should be allowed.

    This is exactly what I mean about “confusing”.

  134. Well yes, different ‘types’ of cloaking – as I said.

    No point in going round in circles about definitions though – that has been covered.

    If I want to ‘deliver’ my content to GoogleBot in one way, and ‘deliver’ my content to visitors in a second way (behind a adsense stuff splash page) that is ok, as the same content is ‘delivered’ to both?

    Or is that cloaking?

  135. As JLH said a few lines above, we are Google’s clients. If Google shows me something on it’s results, and when I clic on it I see a whole other thing, I blame the site, AND google as well. In fact, it is Google’s “fault”. Why? Simple. Google should tell me upfront if to see that information I need to register on that site (paid or not). I understand that forums and other places need to earn money via suscriptions, or want to gain users, or whatever, it is their right! In fact, it may come to be that if there’s an icon stating that I need to pay for a registration to a certain site, I maybe more inclined to use it because I am under the presumption that the info displayed will be of higher quality than the free one. It may indeed open up new revenue avenues for a lot of sites out there…

  136. You stated no point in going in circles so I will not address your question. It’s been addressed many times in this thread and over the years. I will simply say that “cloaking” is a very specific technique used to deliver content that deceives a search engine. It’s not a “broad” thing that encompasses all content delivery and should not be stated as such. It’s in the industries very best interest to narrowly define cloaking. If we don’t, then we hear about how everyone cloaks. We are not just trying to make sure “we” understand things, but have to understand that “everyone” is reading and trying to learn. The less confusing we are about definitions, the better off we are.

  137. Hi,

    First, while I have other “issues” with WW, Brett is not cloaking for deception, and most of the content is visible without payment. Brett simply shuts out some bad networks with known troublemakers unless the visitors can identify themselves as bona fide users or SE spiders. The simple test is, bad neighbourhoods require login at WW, and G is not that sort of a neighbourhood.

    Sometime I’d like to chat to Matt if he thinks that *my* site is cloaking. I cut down what I show to bots because (a) they are 90%+ of my load and I find it hard to give them everything that I show a human visitor without running out of resources and (b) partly on the advice of the AdSense optimisation team at G since some of the full page content was confusing the bots and causing mistargeting. And indeed, even a human visitor’s first hit is stripped down to minimise the time to load that all-important first page.

    I don’t change the content to be deceptive, but to cope with the load and improve perceived performance (and ad targeting). I still don’t know if this counts as “cloaking” or not.

    Rgds

    Damon

  138. Well, I think one thing has been established in all of this back and forth.

    Matt, Adam, Vanessa, et. al….no matter which decision you make or what you do with regard to this issue in the future, you guys are totally, completely, 100% screwed. No matter which tack you take, someone is going to be really upset with you.

    Personally, I like Bill’s idea (go figure the angry old man with the beer in hand being the most reasonable and sane one of the lot of us), but with a slight twist:

    User-Agent: *
    Subscription: /

    Two lines in a robots.txt file. That’s it. (Could use the tag too, I suppose, but this might be a lot easier for people to swallow.)

    From there, the bots that pay proper attention to this sort of thing (e.g. big G, Yahoo!, MSN) can read this and say “oh, this site or section of a site is subscription-based. Let’s index it and put an icon or a textual warning beside the results from this site or section saying ‘may require subscription'”.

    This could also be integrated into personalization of results by allowing users the ability to not see subscription-based results, if they chose to do so. So all the “this is unacceptable” types don’t have to look, and those of us who couldn’t care less or aren’t bothered by it can look if we want.

    It’s easy to implement, it allows those with legitimate intentions (such as Brett’s) to flag their intentions, and everyone gets what they want.

    If the vast majority of us could agree on something like this, this whole thing gets solved a lot more quickly and we can stop stabbing at each other. So…can we? Bill deserves some credit for his idea.

  139. Matt,

    What you explain about WMW and your personal involvement in checking whether it is cloaking or not is not a scalable method to fight spam.

    I am sure your programmers could create scripts that do this type of check with an excellent accuracy and this would avoid that some people suggest your cloaking detection is biased.

    Jean-Luc

  140. What about sites like acm.org ?? I always get the user registration page although I found snippets from papers while searching google.

    Try this: “site:acm.org extrusion prevention”

    Now again, try it with IEEE
    “site:ieee.org extrusion prevention”

    They have all the papers crawled by googlebot, non of the pdf files appears of course when you click on it.

    I think google should give us a link or option through the webmasters tools to report cloakers.

  141. I will simply say that “cloaking” is a very specific technique used to deliver content that deceives a search engine. It’s not a “broad” thing that encompasses all content delivery and should not be stated as such.

    I agree with that. What I don’t agree with is tacking the word “sole” onto it as you did earlier, Doug:

    What I said is that cloaking is “very” specific. It’s detecting a bot for the SOLE reason to rank in the serps, and sending that bot to a page other than Sally would see.

    or

    If you are detecting Googlebot “solely” and intentionally in order to server googlebot one page, and to serve a regular search engine user another page, then yes, that is cloaking and that is search engine spam.

    I don’t see why that caveat is necessary. Let’s try an extremely hypothetical example: If I set up my server to detect Googlebot and to detect you, on your home computer, and only when you’re using Opera, and then I show a particular page to you and to Googlebot, but not to everyone else, then I’m not detecting Googlebot solely. Am I cloaking?

    How is that different from detecting just Googlebot, Slurp, MSNbot, and you?

    How about the bots, you, and users with a saved cookie? Are those enough different users for this not to be cloaking?

    I thought Matt made it perfectly clear when he indicated that if a user clicks through a SERP and sees anything other than what the bot saw, then something is amiss:

    …when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.

    He didn’t say that the failure to show any user what the bot saw was “cloaking” per se, but he clearly indicated that that’s not what Google wants, and this is a post about cloaking.

    I would be fine with a “subscription required” notification in the SERP, as long as any site could do this kind of thing. My problem is that it seems to me that the NYT, NPR, and perhaps WMW are getting special treatment.

  142. My problem is that it seems to me that the NYT, NPR, and perhaps WMW are getting special treatment.

    NYT approaches things slighly differently. There are a couple of flavors of access. NYT allows you to click through from the SERPs to the full versions of (about) five articles. After that, for registration required articles, it takes you to a log-in screen. For paid articles, it gives you a short abstract and asks for bucks.

    So, at least here, if you run across and click NYT in the SERPs, you get exactly what you’re supposed to (well, at least 5 times). It’s very similar to the Google News first click free program.

    On this point I’m still arguing with myself whether or not any “must pay” articles should appear in the regular SERPs. I think I come down on the side that since News has the FCF program then that’s where all of this belongs.

  143. Hi Bob;

    >>How is that different from detecting just Googlebot, Slurp, MSNbot, and you?

    It’s very different. When “our’ industry is discussing cloaking, it’s being discussed in relation to spamming a search engine. Cloaking is spamming a search engine. It’s not spamming a search user. Spamming a user is called user spam. Spamming an engine is called search engine spam. Cloaking is when it’s search engine spam. The definition of cloaking is “deception” or “to cloak something”. In our industry that means “cloaking something from a search engine in order to deceive that search engine”.

    If something other than a spider is getting the same page IE: a user subscribed or signed-in, then whatever is being done is not cloaking. Cloaking can only be happening if the search engine is being deceived.

    You can certainly say the user is deceived when they click on the listing in the serp and get directed to a sign-up page, but that is not cloaking as cloaking is always search engine spam.

    If we don’t narrowly define the word for our industry, it simply gets way too confusing as this thread is showing. Everyone is calling everything being done with any site they find out there cloaking. That is just not the case. For our very, very best interests, we have to be very, very specific with the word.

    And yeppers; we do agree that finding those pages in the SERPs are very bad and not wanted. But it’s lots better than never finding the good content at all. All that is needed is some kind of icon/disclaimer, whatever, that is beside the listing and this situation/problem would be solved.

  144. If something other than a spider is getting the same page IE: a user subscribed or signed-in, then whatever is being done is not cloaking. Cloaking can only be happening if the search engine is being deceived.

    Who says that if the search engine isn’t the only one being deceived that it’s not actually being deceived. Can’t I lie to you and to Google at the same time? If the search engine doesn’t know that the content it’s seeing is not what it can tell its users they’ll see when they click the listing in the SERP, then it’s being deceived, isn’t it?

    I don’t want to quote that same sentence of Matt’s for a third time, but I’d appreciate it if you’d explain how it fits in with your opinion. Do you think Matt is incorrect?

    NYT approaches things slighly differently. There are a couple of flavors of access. NYT allows you to click through from the SERPs to the full versions of (about) five articles. After that, for registration required articles, it takes you to a log-in screen. For paid articles, it gives you a short abstract and asks for bucks.

    So, at least here, if you run across and click NYT in the SERPs, you get exactly what you’re supposed to (well, at least 5 times). It’s very similar to the Google News first click free program.

    It’s the paid articles I take issue with. Unless I have either paid or recently signed up for the free trial, there is no way I will have access to that article, no matter what the SERP indicates.

  145. there is no way I will have access to that article, no matter what the SERP indicates

    Just to clarify, the first five are free to view, then it’s either reg or pay, depending on the type of article.

  146. Bob; I’ve had this same discussion with you before. You will never agree with me about this so no use in trying. 🙂

    What is described about the NYT has nothing to do with cloaking and the deception of Google. You are still discussing what the user sees only. We all know what Google sees, including Google. It’s not cloaking.

    Sorry; but a better man/woman than I will have to try to explain this to you more clear.

  147. Let’s put it this; when Bill, g1smd, D Sullivan, and a few others and I in this thread say that WMW is not cloaking, that’s a few minds with expertise on the subject saying “not cloaking”. I cannot put myself in the same vein as “expert” minds on the subject, although I do understand things totally. I know of many others who have not posted who feel the same way including Phil Craven. None are naive people at all.

    We simply have to have a very narrow definition of cloaking. If we don’t, then all forms of delivery all the sudden are called cloaking. If Googlebot is being detected and shown a page that NO other real person is getting at anytime, then that is cloaking.

  148. Let’s put it like this; when Bill, g1smd, D Sullivan, and a few others and I in this thread say that WMW is not cloaking, that’s a few minds with expertise on the subject saying “not cloaking”. I cannot put myself in the same vein as “expert” minds on the subject, although I do understand things totally. I know of many others who have not posted who feel the same way including Phil Craven. None are naive people at all.

    We simply have to have a very narrow definition of cloaking. If we don’t, then all forms of delivery all the sudden are called cloaking. If Googlebot is being detected and shown a page that NO other real person is getting at anytime, then that is cloaking.

  149. Even if they’re not ‘cloaking’ they should be removed if they redirect to a login page at any time. Open Content anyone?

  150. My eyes are bleeding after reading the latest barrage of comments.

    Maybe I’ll comment later after the Visine kicks in.

  151. Let’s put it this; when Bill, g1smd, D Sullivan, and a few others and I in this thread say that WMW is not cloaking, that’s a few minds with expertise on the subject saying “not cloaking”. I cannot put myself in the same vein as “expert” minds on the subject, although I do understand things totally. I know of many others who have not posted who feel the same way including Phil Craven. None are naive people at all.

    And yet I have the utter audacity to think about it myself and draw my own conclusions instead of accepting the opinions of my betters. I’m awful. I mean, of course it’s a matter of opinion, but that doesn’t give me the right to have my own.

    My last attempt. Look at the sentence of Matt’s I quoted. Do you agree? Does it fit your concept of what is and is not ok?

  152. Dave (Original)

    If WMW is serving up different pages to Google (who don’t have to log in) than *some* users see when clicking the SERP link (who do have to log in), then they are cloaking by Google’s definition.

    However, “cloaking” is only a word. The real issue should be; Google are not doing users any favors by not differentiating these pages for their users.

  153. JLH,

    I thought I got those flying cars out to everyone, I send you another 🙂

    Danny Sullivan, this discussion is literally the first time I have not respected what I am hearing from you. Regardless of how many times you have personally been in a debate such as this one, there are thousands who haven’t.

    What would you, any of you, suggest to someone who is trying to make a clean site? If we can not come up with a definition of “sneaky” then let us define what is not sneaky, actually, I will do that and put it on feedthebot.com

  154. What Bob? I now don’t know what you are asking. You are not clear at all. I’ve explained in extreme detail what my definition of cloaking is. I don’t need to keep on rehashing it over and over. Accept it as it or don’t accept it as I really do not care. It’s my opinion and I’m sticking to it. You can have your opinion as well, but I simply don’t understand your opinion as I don’t even know what it is. 🙂 Don’t ask me questions and then reply back to me with yet more questions. Just spell it out what you are thinking. You already know what I think.

    keniki; What you are missing is that fact many of us do not agree with your statement that WMW is cloaking. You keep saying and saying it, but I don’t accept it.

    And further; I think some names listed in this thread “DO” have something to say about the direction of this industry and what best practices should be. We “have” to work with engines like Google, right?

  155. One more thing, I keep hearing Phil Craven, are you the Phil Craven that once worked at BCD in Tampa? If so, I am Pat, the longed hair guy (Patrick Sexton) say hello to me if you are that Phil Craven.

  156. Hi feedthebot; No, I’m Doug Heil. Nice to meet you. 😀 Phil Craven is well,…. Phil Craven, and I have no idea if he worked in Tampa. I do believe he’s from England. I could be wrong though. His site is webworkshop.net. Think that’s it.

  157. feedthebot, I’m kind of unclear on what you’re upset about with me. If it’s that I’m tired of this debate — sorry. I am. I understand new people are always coming in. That doesn’t mean you keep debating the wheel each year. At some point, you finally make one and move forward. So my contribution this year is to say can’t we finally move forward? There is nothing new here. Literally nothing new other than we’re debating all the same stuff again. If we want to move ahead, we need the search engines to make some substantial changes to how they treat subscription content.

    Doug, when you talk about for “our” industry that cloaking is spamming and the definition of cloaking is “deception,” you’re not speaking for me. I’ve agreed with the registration stuff that is is not cloaking. But that also means, as I explained last year, that all those people with IP cloaking software would also then no longer be considered cloaking if they use that software for this specific purpose. Moreover, I joked that anyone could then go out and start IP delivering content to Google but putting up reg screens.

    I joked because while that might not fit our definition of cloaking, it’s probably going to get you in trouble with Google regardless — hence this entire post that Matt’s had to do.

    So when you say: cloaking is spam, but “If something other than a spider is getting the same page IE: a user subscribed or signed-in, then whatever is being done is not cloaking. Cloaking can only be happening if the search engine is being deceived,” you’re totally going to mislead people. Maybe they aren’t doing what you consider cloaking and thus aren’t spamming or “deceiving” the search engines, but right now that would be an incredible bad idea to do. It’s the sort of thing that gets people blogging about whether you’re, well, cloaking — and Matt — you know, the guy who can ban you — issuing statements that people shouldn’t have to register to see what Googlebot saw if you don’t want to get Google upset.

  158. Dave (Original)

    Danny, “If it’s that I’m tired of this debate — sorry. I am”

    Apparently not.

  159. Yes. I agree with most of your post. However; if we go the other way, then many other forms of content delivery are then called cloaking as well….. as we see in this thread and all over the internet right now.

    My new site is going to have these two main folders:

    ihelpyou.com
    ihelpyou.com/flash/

    The first one is what googlebot and all other se spiders will get, “and” all other real users will get if they do “not” have flash installed. The second one is what all real users will get if they have flash installed. I might even “noarchive” that root folder so it doesn’t show up in the Google cache.

    It doesn’t matter about the “how” many users do not have flash. It could be only two of them in the world. What I am doing is not cloaking. It’s just a form of content delivery. Isn’t this kind of the same thing? WMW is allowing googlebot and others signed in to see one page… no matter how few they are, and all others to get the signin page.

    If we go with the idea of cloaking being a broader meaning, then where does it end? How confusing does it all get? This is where I’m coming from.

    I certainly agree we can’t allow websites to hide all their content behind log-in screens and still be in the SERPS, but what is the answer then?

    It’s a very tough problem and one that is going to require some thought.

  160. I see this thread as only being helpful to the masses. It doesn’t matter to me how many straight years the debate has raged on as it’s all good. I think it’s going on about 8 years in a row now. It can only serve to help all peoples as it certainly doesn’t hurt anyone.

    Waiting for Matt or Adam or both to chime in with a great and helpful and well thought out post. 🙂

  161. I already did. 😉 Now gimme them peanuts!

  162. In response to this thread and a huge amount of feedback, discussion, phone calls, im’s, sms’s, and even a carrier pigeon (don’t ask), I made a few changes today.

    I took a long look at the current requried login list and why those particular sets of isp domains were there. I just don’t think it is feasable for webmasterworld to allow unfettered downloading, scraping, attacks, copyright bots, and ddos’ing attempts from isp’s such as:

    rr.com, comcast.net, pacbell.net, bellsouth.net, btcentralplus.com, verizon.net, chello.nl, swbell.net, t-dialin.net, bellsouth.net, ibm, saix.net, charter.net, charter.com, btopenworld.com, bellsouth.net, verizon.net

    Those represent the worst of the problems we have seen. High speed cable users who think nothing of unleashing a bot at full speed off a 5meg line with 5-10 simultaneous connections.

    Here is an example from a bot just 5 mins ago (as I type this):

    145 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=stuntdubl
    146 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=shri
    147 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=volatilegx
    148 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /website_technology/
    149 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /content_management/
    150 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=Xoc
    151 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=madmatt69
    152 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=dreamcatcher
    153 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=ferhanz
    154 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=zCat
    155 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=caine
    156 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=rocknbil
    157 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=wheel
    158 11:09.13 pm Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 203.244.131.17 /profilev4.cgi?action=view&member=Komodo_Tale

    That is a handful of lines from well over 1k dls in ten mins. That is a specific bot written specifically to scrap webmasterworld profiles and collect email addresses. We see atleast 1 of those specific profile bots like that every day.

    Another one caused us to reboot the server 3 times today because of machine saturation due to spidering. (on a normal day with normal load, we rarely get about about 25% max load on the new server). I have no doubt it was due to this blog entry.

    So, dialing back the requried login is not an option to me. Regardless of what some of you guys running little sites think, you can not set by and let bots continue to damage your system. Our first duty is to the membership – not the bots – and not the drive by viewers.

    The other option I considered is starting to post the ip’s, log files, and abuse reports to the ISP’s on the HoneyNet project website. After discussing it with our corp attorneys, they advised me not to do that. I’m going to take that advice.

    I did take a look back at how well matt’s recommended “first click free” system had been working. I love the whole concept of it, but I don’t think it is working well (as witnessed by some of the comments above). One problem is multi-paged threads that take 2 or more clicks to view in their entirety. Or someone visiting the home page, and then clicking on a link – two clicks. I just dont’ think “first click free” is enough. So, I bumped it up to about 4-5 clicks right now. I don’t know if that is low enough to discourage someone cranking up a slow bot, but lets try that.

    The other thing that I did, was switch the screen off the login page, to this new error page
    http://www.WebmasterWorld.com/error.htm

    I thought that was as good of a standard statement page as we will find on the net today. So, if you can’t beat them…join them…

    Thanks
    bt

  163. Those of you who know WMW well know also that Brett doesn’t like to talk about “security” matters in public 🙂

    Just for the record and for those of you who wish to know more about the “Attack of the Robots” on WMW. I wish to recall a two-parts “hot” thread started by Brett in November 2005:

    lets try this for a month or three…last recourse against rogue bots

    Attack of the Robots, Spiders, Crawlers.etc

  164. Brett, nice to see you making changes – but have you considered simply changing the registration page to a capcha page (without adverts / other links) – just a straight capcha to access the content that the SERP result was based on?

  165. Brett, I repeatedly had my site floored by bots and SEs (and multiple idiots opening up *50+* connections at once each so that I could not get in to the server locally) which is why I switched to JSP so that I *could* regulate such mindless destructive traffic.

    As long as the traffic that you let the bot see is nothing that has to be paid for with money or personal info, then IMHO you are not being deceptive or cloaking. I am assuming that you *don’t* let G’s bot access the Supporters’ area for example.

    Rgds

    Damon

  166. Dave (Original)

    IF members privacy was the issue, nobody’s email would be visible to anyone period.

  167. That wouldn’t necessarily stop someone from trying, Dave. I think that’s the bigger issue here.

    Brett: you might want to look at the title of that web page. Granted, your site targets mostly tech-savvy types but a newcomer who hit that might be confused by an “789026y89234234hllkhjp9o70p0 Access error”.

  168. It seems to me that cloaking is based on deception pure and simple. There are valid reasons to use IP detection/delivery to provide or deny select content, many raised in these comments.

    For example, what about a site that requires age identification for legal reasons? Currently, many of these sites can’t be crawled (bots cannot input the age information) without some form of “cloak” or bypass. Using a cloak solution to bypass age registration for the bots is not deception. The content eventually presented in the SERPs is available as long as the age requirement is met. One is presenting the same content, but also fulfilling a legal requirement. I think paid subscription sites are a bit of a different animal, but that is my $0.02.

    The same can be said for different site versions (language, device, technology type). What if I maintain multiple versions of a site (html, xml, wap, flash, etc.) but with the same content source? Is it “wrong” to apply logic to the delivery of the content? This isn’t “cloaking” just a logical content delivery strategy.

    Whatever the definition of “cloaking” is, it would be nice to see it applied consistently so that webmasters had a better sense of what is “OK”. Right now, it seems there is a double standard for sites like NPR, NY Times, academic .edu sites, etc. and everyone else.

  169. Seems like a fair balance Brett, I like how you have the searcher in mind and give more than just the first click for free.

  170. IF members privacy was the issue, nobody’s email would be visible to anyone period.

    Had you looked at WMW’s profile feature you would notice that the email address disclosure is the MEMBERS choice, which puts the privacy where it belongs, in the hands of the members. If I want the other members of WMW to see my email address, that’s my business.

    However, keeping the spambot harvesters off the site is WMW’s job.

  171. So what about ‘link cloaking’. Not sure what to call it so I made that up. I’ve seen a couple examples of people who will get links posted on others sites which are javascript-text ads (like adwords). But they have the keywords.

    I’m assuming because only search engines see the link (and the occasional person with javascript turned off) that this is deceptive and thus against webmaster guidelines. Correct? Should they be reported?

    -tim

  172. I don’t know if Matt will actually read all these comments, but I just want to focus in on the “paywall” subset of this discussion.

    Wouldn’t it be nice if Google had a formal way to differentiate between completely free content and behind-a-paywall content? This information could be reflected in the search results, i.e. with a “$” icon or whatever, so that it would be easy to skip links that would result in a login form appearing in place of the content. Actually, maybe distinguish between sites with no restrictions, sites with free registration, sites with paid registration…

    URLs could be labeled as such via an extension to the Sitemaps protocol, for instance.

    This won’t solve anyone’s problems with overactive bots, but it would certainly enhance MY experience as a user. You could even add special operators to only search certain types of content, etc.

  173. Ok. That didn’t work. The above post where there is a link to ‘keywords’ didn’t show up in the post. Basically, they display ads, but have a noscript tag with a link pointing back to their site (with keywords as the anchor text). This has gotten this particular website the #1 position for at least 4 different extremely competitive search terms and probably without the sites that link to them even knowing.

    -tim

  174. Matt,
    If a searcher is looking for “ABC” and Google links to http://www.zzz.com/123.html but the sites best info for ABC is at http://www.zzz.com/ABC.html so it 302s the visitor, within the site, to http://www.zzz.com/ABC.html is that cloaking or illegal? All pages are freely available and googlebot has seen them all and they are not new. It’s just that Google is not linking the visitor the best info.

  175. Wouldn’t it be nice if Google had a formal way to differentiate between completely free content and behind-a-paywall content? This information could be reflected in the search results, i.e. with a “$” icon or whatever, so that it would be easy to skip links that would result in a login form appearing in place of the content. Actually, maybe distinguish between sites with no restrictions, sites with free registration, sites with paid registration…

    URLs could be labeled as such via an extension to the Sitemaps protocol, for instance.

    This won’t solve anyone’s problems with overactive bots, but it would certainly enhance MY experience as a user. You could even add special operators to only search certain types of content, etc.

    Exactly. As long as every site had a way to legitimately do this, and those who did it in an illegitimate manner were stopped from getting away with it, I’d be all for it, no matter what name it was given.

  176. G’s able to suss out subscription required sites for Google News. Those that don’t use the first click free program are identified with a “(subscription)” tag.

    If G can do it there…

  177. It isn’t cloaking!

    Who is the Philipp Lenssen and why does he want talk about cloaking? Cloaking is technology that is available to a site owner if they chose to use it. Cloaking is also a website technology issue and the very first question I need to ask myself is, does Philipp Lenssen understand website technology.

    (Testing Philipp’s Web-Tech Abilities)
    http://www.outer-court.com/
    http://www.outer-court.com/index.html
    http://www.outer-court.com/index.htm
    http://outer-court.com/index.html
    http://outer-court.com/index.htm
    http://outer-court.com/

    Conclusion:
    Philipp, you have failed a very basic web-tech test. Please demonstrate your understanding and implement basic web-tech issues before you try to tackle more advanced issues.

    WebmasterWorld:
    – 14 links from the Y and Moz directories (Anyone who doesn’t like what WMW is doing can go else where. Many editors have listed the site based on the free original content.)

    Content protection and is what being used at WebmasterWorld. Try logging in and running a “site ripper” to find out how long it takes you to get banned. Now changes have been made so that even if you are banned the site owner is still going to show you content. No rogues should ever be welcome, human or bots.

    Brett shouldn’t ever have to serve content to “system resources” thieves. If you are suspect, logging in should be required.

  178. Dave (Original)

    RE: “Had you looked at WMW’s profile feature you would notice that the email address disclosure is the MEMBERS choice, which puts the privacy where it belongs, in the hands of the members.”

    Then the choice has been made already. However, had you thought about it, it would be a lot cleaner if there wasn’t the choice to begin with, then this would be a non-issue. Members already have Private Messagging.

  179. Dave (Original)

    Matt Cutts Quote: “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.”

    I think that sais it all.

  180. Showing an interstitial registration page to humans and the final content page to selected bots is NOT cloaking. A registration page is NOT content.

    What would you say if you, the human, just got a 401 status returned and a bare popup login box, instead of a HTML registration page? That would be even less friendly to humans, as it would give no way for you to easily find how to even register. At least a registration “page” has further instructions for you to follow – if you want to.

    A registration page is NOT “content”. Cloaking is showing different CONTENT to bots, in most cases keyword loaded spammy content, such as discussed at: http://www.mattcutts.com/blog/ramping-up-on-international-webspam/ – that is why cloaking describes a SPAM method.

    I cannot understand why people here cannot see the difference.

  181. The average Joe is not going to realise without digging around a site that the presentation to them is different to what a Googlebot will see.

    … they should always see the same page that Googlebot sees

  182. *** If WMW is serving up different pages to Google (who don’t have to log in) than *some* users see when clicking the SERP link (who do have to log in), then they are cloaking by Google’s definition. ***

    If the bot saw a page loaded with keyword spam, and nonsense phrases, and a user could NEVER see those pages then THAT would be cloaking.

    A registration page is not content. This is NOT cloaking.

  183. Cripes, you guys do go on, don’t you 😉

    Its all a lot of hooey over the meaning of a word.

    Let me solve the problem for y’all
    Lets have a few new words added to the language.

    The nicely specific:
    ‘dougiecloaking’
    ‘dannycloaking’
    .. or even ‘googlecloaking’ if we can get a definitive statement out of Matt.

    Of course, most of us old timers will still use the ever so vague and constantly redefiined ‘cloaking’ to mean whatever we want it to, as in the immortal quote:

    ‘When I use a word,’ Humpty Dumpty said, in a rather scornful tone,’ it means just what I choose it to mean, neither more nor less.’

    For the record, ‘4eyescloaking’ would always depend on the intent, and I would still class WMW as having ‘4eyescloaked’ despite the ‘plausible deniability’ defense. Not that it matters, all SEO is an intent to deceive (sorry, of course I meant ‘all forms of SEO except DougieSEO’)

  184. yeah, it’s amazing that people cannot see nor understand the obvious. Good posts g1.

  185. To a search engine spider, content is any text that can be indexed. A registration page’s content is the part that says something to the effect of “Please register”.

  186. Just came across this example of an alternative entry page set up to keep out
    “the thousands of porn/meds/etc sites which have been hammering our site with fake referers”
    http://www.nafi.com.au/timbertalk/index.html?start=3251

    I feel that WMW and similar sites offering FREE access should not be considered as cloaking. Disposable e-mail addresses are easy to get!

    I do, however, think that if paid content appears in the SERPs, it must be indicated so I know not to click the link in advance.

  187. Sorry Doug, it appears some people just don’t get it and god help them if their business is suddenly the target of what happens from the dark underbelly of internet.

    When Google says you can’t install roadblocks to protect your site from being DoS’d or scraped, which aren’t cloaking, some of us will be faced with either going offline or trying to make a go of it without Google. The most we can hope for is that the other search engines are rational can see past the cloaking rhetoric in those cases where it’s not cloaking at all.

  188. Bots should never see tens of thousands of “Please Register” pages on a site. Adding those URLs to the SERPs adds nothing to the search experience for users.

    For forums, I always exclude ALL user-action URLs that return a “You are not logged in” message from being indexed. Those include all URLs used for starting a new thread, replying to a post in a thread, editing a post, sending a PM, or editing a profile, and so on. Only content pages (threads and thread indexes) get indexed, even though you have to register and then log in to be able to post in the forum.

    Brett has gone one step further in that you need to be logged-in to read the forum if you meet some criteria that cause doubt as to whether you are a real user or a bot.

  189. Correction:

    … that cause doubt as to whether you are a real user or a [i]rogue[/i] bot; Googlebot and other majors always get a free pass into the forum..

  190. Correction:

    … that cause doubt as to whether you are a real user or a rogue bot; Googlebot and other majors always get a free pass into the forum..

  191. “Dave (Original) Said,
    March 6, 2007 @ 3:49 pm

    Matt Cutts Quote: “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.”

    I think that sais it all.”

    Well, I think that’s not possible as the Googlebot doesn’t have the same capabilities of a browser (full JS support, FLASH, SVG, cookies, accept language, etc).

    What about geolocation? Would that be considered as ‘bad cloaking’? Users from US (so Googlebots) may obtain a different content than users from Spain.

    So, Googlebot and users may not see the same content and that’s not bad (actually that’s good).

  192. Dave (Original)

    RE: “If the bot saw a page loaded with keyword spam, and nonsense phrases, and a user could NEVER see those pages then THAT would be cloaking.”

    Nope, that is your interpretation of “cloaking”.

    Google only states;

    “Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.””

    And Matt has stated, in the context of cloaking;

    “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.”

    Which I believe sais it all. There should be no need to play semantics with a patently clear statement like that.

  193. If for example what wmw and nyt are doing is considered “bad cloaking” then google.com should definitely be banned and taken out of the index.

    If I type in google.com I get redirected to my countries google site – but I do want google.com – at not some other google site. If google thinks people are too stupid to find their countries google version they can place a link at the top and ask did you want to see google.xxx

  194. When I click on a Google search engine result, and the information Google had is not available to me without a paid subscription, I feel betrayed. By the webmaster AND by Google. Everyone I’ve whined to about it feels the same way.

    IMO, it’s as simple as that. This natural moral response to being intentionally frustrated is essentially the correct one, and I believe that as this situation plays itself out in the next decade, it will be shown to be the systematically correct one, too.

    Because practically speaking, we can’t have a situation on the web where more and more information is made available to be searched but not actually to be accessed in real-time by the rank-and-file searchers. Does anyone else see the problem with this? When 30% of searches actually end up denying the previewed information, it’s a very annoying problem. When 70% of searches do it, it’s a disaster. It won’t get that bad, you’re thinking? Who is going to prevent it if the New York Times does it without penalty?

    What I expect from Google is to let me know BEFORE I click a search result that it will be fruitless for me unless I register and/or pay (and I expect both requirements to be tracked separately), and I further expect for Google to let me filter out either/or with a simple checkbox. I’m not getting any of that from Google right now, but I’m quite sure that one day I will, because without it the web will be broke before too much longer.

    I can already tell you that using Google today versus two years is a substantially different experience in terms of frustration factor. If it keeps going like this I will switch to whichever search engine is most brutal with exposing webmaster’s practices to the light of day in the SERP listings themselves, so that if I don’t like what Google is telling me about your site —- no clicky for you.

  195. One more thing that has to be said. When I am taken from a Google results page to a registration page instead of to the previewed information, and that registration requires a fee, AND I am shown ads on that useless registration page, I feel like flying into a white-hot rage.

    If you lure 100 people to see an ad, through Google, in the full knowledge that only a 10 of them (very optimistically) will be willing to register and thus 90 of them will never see the information you lured them here with, then you have just profited from luring 90 viewers to your ads in the full knowledge that they will not find what they are looking for — and how is this NOT gaming the system?

  196. I just discovered that Google ostensibly allows filtering of sites by whether the content has a free licence in the Advanced Search, but it’s obviously not supported by robust enough data, and as a result it’s so broken that it will not even filter out the most famous examples. For example, try a ‘free content only’ advanced search on Paul Krugman…

  197. Different (human) users may legitimately get different content for the same URL targeted to their Accept-Language, User-Agent (eg for small screens), bandwidth, cookies, etc. Some of that is called “content negotiation” and is part of the HTTP spec.

    So as long as Google’s/SE’s bot gets to see what *a* normal human would (not ALL possible settings for all possible visitors, but I guess not an extreme outlier either) then an over-literal interpretation of Matt’s words is dogmatic and unhelpful to you, me, SEs and visitors alike.

    Let’s avoid deception but embrace some pragmatism here, please.

    (Matt, tell me if G’s giving up on pragmatism any time soon!)

    Rgds

    Damon

  198. If the user is sent to a page tat is on topic with what the googlebot saw, then is this an issue ??

    What should be stamped on is a googlebot seeing a indexing a “soft” keyword, only for the user to be directed to Viagra or XXX

  199. A pragmatic course of action only remains so as long as the circumstances around it don’t change too much. With the explosive growth of dead-end search results the last few years, what is practicable and pragmatic is changing every day. Can anyone disagree that Google should be anticipating this trend and adjusting their policies to compensate, instead of letting it just happen to them (and to us)? Google’s approach-to-date did not and will not exist in a vacuum, and thinking so is a great way to lose that #1 plum search position.

  200. So as long as Google’s/SE’s bot gets to see what *a* normal human would (not ALL possible settings for all possible visitors, but I guess not an extreme outlier either)

    And you’re suggesting that that would include a registered user with a cookie that will identify them and sign them in automatically when that cookie is recognized?

    As I see it, when any user-agent initially attempts to begin a session on a given server, it can be classified into one of three groups (this is just the first request of the session — obviously, one can log in and break things down into more than three groups after this).

    1. Banned: these users, when detected, don’t get anything off the server with the possible exception of an error message. In the case of a compliant spider, you don’t need to detect them; you just block them via the robots exclusion protocol.

    2. General Public: these users will have access to anything that is not protected. The server doesn’t check to make sure a given user-agent is a member of this group; it checks to see if the user-agent isn’t a member of the group.

    3. Members/Subscribers: the user is recognized by the server (normally via a cookie) and is given access to content that the general public can’t reach.

    If you detect the spider and treat it like it’s a member of group 3, it’s going to treat private content as if it’s public, and it will report back that that information is accessible to the general public.

    That’s deceptive. You’re specifically checking for the spider in order to put it into a group other than the general public, when it’s designed to be part of the general public so that it can report back to that group about what content they can access.

    If members of the general public are required to take an extra step to identify themselves in order to gain access to certain information, that’s something the spider should know, and treating it like a member serves to keep that information from it.

  201. Registered Users from any ISP and Ordinary Visitors from Trusted ISPs and Verified Pre-Registered Bots get to see the content.

    Untrusted Bots from anywhere and all Visitors from Untrusted ISPs are asked to Register. If they really are a human they have a chance to access the content. If they are a bot then their unwanted access attempt has been thwarted.

    I still don’t see any deception being pulled on the search-engine bot.

  202. The real issue here isn’t whether a LOGIN is cloaked or not, it’s why aren’t some ISP’s don’t handle AUP complaints of abuse. This causes some webmasters to resort to slapping CAPTCHAs or LOGINs in front of surfers to stem the abuse.

    If Google would put resources behind those ABUSE issues regarding scrapers and ISPs that permit them to scrape, then perhaps this issue would be a moot point.

    More importantly, if Google didn’t permit scrapers in their SERPs or the AdSense program, then again, that costs them money, stockholders complain, screw the webmasters…

  203. Agreed, but it comes back to intent again. While some webmasters are protecting themselves from scrapers, others can use this technique to monetize content and/or collect personal information without the search engine knowing.

  204. Dave (Original)

    The “real issue” is simply what Matt stated

    “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.”

    Debating semantics over what is cloaking and what isn’t is non-productive.

    Attempting to hold Google responsible for scrapers etc is also non-productive. Like it or not, Google’s responsibilty is to its shareholders. This is looked after by looking after its searchers, NOT webmasters individual wants.

    There are millions of sites that protect their content without treating googlebot differently to humans.

  205. Dave (Original)

    WMW, has every right to protect their content, as does any site out there. I think we would agree with that. However, it appears that some believe Google doesn’t have retain the right to decide what is acceptable and what is not for inclusion in their Index. In this case that is;

    “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw.”

    We all get our pages scraped at sometime. However, no use killing flies with sledge-hammers. Also, I have yet to see a scraper site out-rank the original. I believe that is by design of Google.

    The fallacy of protecting members emails simply doesn’t wash with me. There are ways to deal with that without “cloaking”.

    My site pages have been scraped, copied, stolen etc more times than I can imagine and what I know of is likely only the tip of the ice-berg! However, UNTIL the benifit of Google sending me tens of thousands free visitors is out-weighed by scrapers, I have no intention of biting off my nose to spite my face, shooting myself in the foot, or using a cure worse then the disease.

  206. I think that what has been lost in this conversation is who the search engine is serving. It is not serving webmasters, it is serving *searchers*, and needs to focus on the searcher experience. Who cares what constitutes “cloaking” or what is deceptive. What matters is what is going to result in the best experience for the searcher, and the best possible search results – period.

    Now, a previous commenter already said what’s needed: Google should display the content snippet, and add a “Paid content” or “registration required” notation if applicable.

    So, I’m going to go get some coffee, and Google engineers, you go ahead and finish that up….kay? 🙂

    Gradiva

  207. Please do not add a “Paid content” or “registration required” notation in the search results. If you do that it will encourage more cloaking and the Web will be filled with login screens. It would make hours of extra work for visitors to keep track of all the logins/passwords.

    Content that is hidden behind login screens should not be indexed by search engines. Those kinds of sites need a business model that does not revolve around cloaking.

  208. tips: if anything, the opposite would occur.

    Think of this from a user standpoint…if you’re looking for something on big G, and there’s a little icon beside a result that means “registration required”, are you going to click on that listing? Probably not.

    *** IDEA BASED ON SPECULATION TO FOLLOW ***

    And if users were able to filter subscription/registration-based results via personalized search, thus eliminating those results from their queries, would any half-decent SEO or webmaster type want to hide content behind a login screen? Of course not. They’d lose a percentage of potential customers.

    Besides, registration and cloaking are two totally different animals.

    Like I said earlier though, no matter which path big G takes on stuff like this, they’re totally and completely screwed.

    Inaction will piss some people off.
    Notation will piss some people off.
    Filtering out the results entirely will piss some people off.

    So what do they do? They really can’t win.

  209. Registration screens can’t exist in Google search results without cloaking (or some kind of favoritism/collaboration).

    Currently Google users do not have to commonly deal with registration screens from the search engine results. It is hard to put up a registration screen without blocking Googlebot unless you are cloaking.

    Some people don’t use personalized search and don’t want to ever use personalized search. Personalized search is a privacy violation. One company (Google in this case) should not be collecting so much data on its users. It is one thing to catalog “the world’s information” (not evil). It is another thing to catalog so much data on users (evil). Creating annoyances in the search results that would require the use of “personalized search” to bypass would be additional evil on Google’s part.

    You mentioned that people would be unlikely to click on results that said “registration required”. I don’t agree. If you are looking for something and the title and text snippet agree with what you are looking for, you might click on it anyway, even though you have to go through the frustration of registering for a new site and keeping track of a new password.

    Encountering registration screens from the search results are a bad thing.

  210. The issue is bigger but more basic than cloaking. Google has announced in the past that its goal is to index the world’s information. The part that needs clarification is, did they mean:
    1) the World’s Free and completely open Information
    2) the “World’s Free and or requiring a free login information
    3) the “World’s Information regardless of what hoops you have to jump through to see it (pay for access, sell momma for access, trade soul for access…)

    When Google makes that clear, then Matt can initiate a policy and we can all get back to work.

  211. I don’t think the issue is fully covered without taking a look at what Google Scholar is doing, per this forum post at SEW, quoting an email:

    http://forums.searchenginewatch.com/showthread.php?t=16563

    ====================================

    – Publishers allow Google crawlers to access full text content (based on
    IP-address-based authentication, similar to institutional subscriptions
    most publishers already support).
    – Google search results point back to full text links. No cached link is
    displayed for these results.
    – When a user clicks on such a link, they are delivered to the publisher’s
    site which should display at least a reasonable summary of the paper in
    question. For scholarly content, this usually means an abstract.
    =====================================

    So that raises questions:

    How does an abstract on white paper type content and/or scholarly articles differ from what amounts to the equivalent of a meta description? How do they compare, and in what way are they used?

    If Google considers scholarly papers to be valuable content, enough to make special provision and provide specialized Scholar search, in what way are scholarly and other, non-scholarly sites, different or of equal value?

  212. It is hard to put up a registration screen without blocking Googlebot unless you are cloaking.

    That’s incorrect.

    It’s actually quite trivial to build a list of who’s allowed full access, restricted access, or banned entirely based on IP or reverse DNS.

  213. “Inaction will piss some people off.
    Notation will piss some people off.
    Filtering out the results entirely will piss some people off.”

    Notice that first two could piss off Google’s actual users, whereas the third option only pisses off people who are trying to profit from Google’s users.

  214. Er … actually both of the last two options only piss off people trying to manipulate Google’s users, come to think of it.

  215. Y’know what? Someone should code up a browser that identifies itself as ‘Googlebot’ and thus lays open all of these ‘registration for all but Google’ pages to the public in a single stroke. It would be a well-deserved lesson for those attempting to control which unregistered user agents can see their content for the purpose of luring people with information you aren’t actually giving them. (Because face it, there is no other purpose for this.)

  216. Dave (Original)

    RE: “That’s incorrect.

    It’s actually quite trivial to build a list of who’s allowed full access, restricted access, or banned entirely based on IP or reverse DNS.”
    ======================================

    IF full access is granted to googlebot and restricted/no access to some, or all humans, then that flies in the face of;

    “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw”

    Or, put simply, cloaking.

  217. Paul Laroquod: exactly.

  218. It’s actually quite trivial to build a list of who’s allowed full access, restricted access, or banned entirely based on IP or reverse DNS.

    How does that differ from cloaking? (showing Googlebot different content than general visitors)

  219. Paul Laroquod try Firefox, plenty of extensions to do just that, which is why the real cloaking pros do a reverse DNS lookup on the IP rather than the user agent.

  220. “which is why the real cloaking pros do a reverse DNS lookup on the IP rather than the user agent”

    Dang.

    Why can’t Google just run a bot with a different user agent, compare the results, and if they’re different, just list the search result with an inobstrusive “Warning: The info on this page may not match what was indexed by Google.” It’s pretty neutral language which doesn’t make any accusations toward the webmaster, but it allows people who are only interested in seeing exactly what Google saw to make far far better use of their time on Google.

  221. User-Agent is trivial to forge and is entirely untrustworthy.

    Matt himself AFAIK in this very site suggested reverse DNS lookup to verify that an incoming bot really is the Google bot.

    To be quite clear: I HAVE NO REGISTRATION OR LOGIN ON ANY OF MY SITES (except to stop people *uploading* illegal material). I do not charge money for access. I *do* have a problem with bots and their CPU/bandwidth demands and have had this problem since at least before G existed. I show cut-down pages to reduce these stresses and increase performance when visited by bots, by a user for the first time, and when the system is under stress. I also use HTTP-standard content negotiation to tune the page to the user. Arguably there is no *standard* page that gets shown to all users.

    I *strongly* object to statements above that the only reason to do this is to be deceptive and/or screw money out of innocent users.

    Please stop insulting people whose problems, antagonists and technology you clearly do not understand.

    Rgds

    Damon

  222. Damon, I am not accusing you of anything personally here. I don’t even know which site you’re running. I don’t need to understand anything about your situation to know that more information is the best thing for Google’s users. And I definitely don’t understand why it makes you so angry that people want to know if a site is going to require them to pay before they click on it.

  223. WMW (the site that this discussion kicked off discussing) does not require that you to pay before you get access; you only have to validate a working email address and associate it with a chosen user name.

  224. Hi,

    What makes me angry is the assumptions made by the people who have not had their servers/bandwidth repeatedly and entirely wiped out by a bot over many years. If you were one of us you’d realise that some drastic measures are needed to enable anyone to see any pages AT ALL on our servers.

    One of the bad bots in question was G’s as it happens.

    It probably wasted 1000+ hours of my time converting my site from simple Apache flat file to something that could fend off the bots relatively gently. It has a continuing cost in bandwidth to me now. This is not hypothesising or grandstanding.

    It would be great if people could be warned in the SERPs if payment is required, but WW, which is the subject of this thread apparently, only requires payment for the Supporter’s area AFAIK, and also AFAIK, that doesn’t get shown to G’s bot either. If it does, then Brett should fix it, but that doesn’t seem to be the substance of this discussion.

    So, to be clear, neither my site nor WW is showing G’s bot ANYTHING that you’d have to pay money to see. And in WW’s case, at *worst* you have to register if your ISP pollutes the Net. I screen bad IPs too, to protect my site, though slightly differently to Brett it seems.

    Rgds

    Damon

  225. Dave (Original)

    Pay-to-view is not the issue in this case , it’s simply not always seeing what googlebot saw when you click on the SERP result. That is not good for Google or its users.

  226. If there is just one site on the web that has the exact information that I am looking for, but I need to register to see the information, Google would do me, the searcher, a dis-service by failing to let me know that the page even exists.

    Once I at least know that it does exist, it is then my choice as to whether to register and access the site, or not.

  227. IF full access is granted to googlebot and restricted/no access to some, or all humans, then that flies in the face of;

    “when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw”

    Or, put simply, cloaking.

    I don’t code my site specifically for Googlebot, never have, never will, and last time I checked they still haven’t completed their conquest of the internet.

    I code my site for spiders in general, users in general and to reasonably filter or block abuse with some levels of human authentication.

    If that results in one SE dropping my site, so be it, as I have the right to protect myself regardless of what kind of wacky little label is cast upon that practice.

    EOM

  228. Dave (Original)

    Yes, but it’s not about you, it’s about *all* Google users. Besides, it’s silly to assume that any page that requires registration would have *exactly* what someone seeks. Googlebot certainly cannot make that assumption and allow this type of “cloaking* to perpetuate.

    What will most likely happen is, Google users will get fed-up with Google, and vote with their feet, after clicking a SERP result and then having to register to see what googlebot saw only to find it doesn’t meet their needs. Even if it does, it’s the experience that Google wants for its users.

    This is why Matt has stated many times over that when a Google user clicks a SERP result, they should always see what googlebot saw. Nothing about having to regsiter 1st, or jump through any other hoops. It’s simply a case of common sense.

  229. Dave (Original)

    “it’s the experience that Google wants for its users”

    Should read

    it’s not the experience that Google wants for its users.

  230. This is why Matt has stated many times over that when a Google user clicks a SERP result, they should always see what googlebot saw. Nothing about having to regsiter 1st, or jump through any other hoops. It’s simply a case of common sense.

    I’d rather know the sum total of everything that’s REASONABLY available on the internet that’s relevant to my query and make my own decision to answer that captcha, login or even pay.

    Not having it shown just because it’s behind a login makes no common sense at all, unless the information is restricted and available by INVITATION ONLY meaning you have no way to access it no matter what you do. As long as I can exercise what ever is necessary to gain access to the actual information, I’d like to know it’s there and have that choice. Don’t need other people making those decisions for me, or for my visitors for that matter.

    Perhaps you missed my Yellow Pages analogy above as the YP doesn’t discard listings because you have to pay a cover charge, be screened by a bouncer, get buzzed in, or forced to (gasp) wear shirt or shoes to get inside, or anything else as long as it’s legal.

    Sounds like Google’s trying to mandate a bit more stipulations than even the offline world to get your service listed, which is a bit heavy handed IMO. Especially when Google is more or less responsible for providing the AdSense incentives that caused the scraper escalation to the point people resort to CAPTCHAs and LOGINs.

    Google can’t have it both ways. They can’t entice internet bandits to run amok digging AdSense gold and then forbid webmasters to use locks where appropriate to keep the crooks out. They then can’t have the audacity to label it cloaking which is what the crooks also use to leverage your information to spam the SEs.

    For some reason I think the cycle of pain Google is causing is being overlooked as legitimate webmasters shouldn’t be punished by Google that are suffering because of Google.

    If you think about it, you know I’m right, which is why I’m firmly behind what Brett and many others are doing to fight this Google funded scourge.

    Don’t you see the entire irony of this thread?

    It’s completely hypocritical from the start.

  231. Dave (Original)

    You are missing the point of the BIGGER picture by focusing only your “I” posts. IF Google were to allow anyone and everyone to do serve up different pages to googlebot than humans (perpetuate WMW methods) it would cause it to fall from grace. You say you are ok with it ‘as is’, but I’m certain even you wouldn’t be if it became the norm.

    Just as Webmasters retain the right to protect their content in anyway they see fit, Google retains that same right. Millions of Webmasters protect their content without biting the hand that feeds them.

    I’m astounded that anyone would even consider this ok!

    In regards to Webmasters “that are suffering because of Google”. Sorry, but that is a crock IMO. 99% of Webmasters prosper “because of Google”. Let’s not have the tail wag the dog.

    RE: “If you think about it, you know I’m right, which is why I’m firmly behind what Brett and many others are doing to fight this Google funded scourge.”

    If Google is as bad as you insinuate why don’t you block googlebot? Seems to me you and “many others” are doing the complete opposite, that is, allowing Google FULL access while some humans are being restricted. Now that IS “irony” and “completely hypocritical”.

  232. If you want to twist my words about the problems that the free-for-all AdSense policies cause with all the scrapers, then enjoy the game.

    FWIW, some of us were prospering long before Google came along with just the Yahoo directory, then Alta Vista, Inktomi and so forth so don’t give Google so much credit regarding all our prospering. As a matter of fact, if Google’s office sank into the quagmire tomorrow we’d probably still prosper when Yahoo and MSN picked up the slack.

    You can have the last word while I go cloak something for no other reason than this thread has driven me to it…

  233. Dave (Original)

    Bad Google! They create an advertising avenue that allows a win for advertisers and those who *choose* to use AdSense. Then some scum start to scrape site content and abuse AdSense and you blame it all on Google and not the *real* perpetrators. Bad Google! Let’s also blame those in medicine who discover drugs that save lives and relief pain , after all, they are the cause of the illegal drug trade.

    Have ever considered the VERY likely fact that IF another SE takes the place of Google, they would become *the* target?

    BTW, I quoted your words, not “twisted” them 🙂

  234. “If there is just one site on the web that has the exact information that I am looking for, but I need to register to see the information, Google would do me, the searcher, a dis-service by failing to let me know that the page even exists.”

    Since advanced Google search filters are applied on an opt-in basis, this is a straw-man scenario that need never happen.

  235. This debate has gone on for more than a week now, but I think we’re all waiting for a final word from Matt, regarding both this particular case and the general idea of detecting a bot to serve it content instead of a login page.

  236. I think Matt gave his final word on cloaking right here: http://www.mattcutts.com/blog/a-quick-word-about-cloaking/

    I was thinking the same thing, that it was odd that he hasn’t jumped in and mediated some of the discussion in the comments here, but then I thought about it again.

    In his last paragraph Matt says, “…Google searchers get the identical page to what Googlebot saw.” Which to me is pretty black and white, without a lot of shades in between. He says “identical”, not:

    ~ a sign up page for free or otherwise
    ~ as long as the NEXT page is identical

    Most of this comment thread has been on the semantics of terms in the industry, which has its place, but bottom line is this: When a user of Google’s search clicks on a result they should expect to see exactly what that link is, nothing else, no other steps, no matter what. Any other result falls into that “sneaky” category of redirects, which is a bad thing.

    What you do with the visitor after that first page is up to you, maybe have the 2nd click need an email…or the 5th, which is fine as Google isn’t recommending the whole site for your search result just the one page they linked you to.

    I’d also suspect that since this post was made that they will be ramping up the enforcement of it. So I’d head the implied warning or be prepared in the least to trying to rank for “subscription required” or whatever is on your sign-up page, or at worst being de-indexed. Personally I’d rather see them just replace the contents of any URL that redirects with the page it redirects to than a full out ban. Imagine the horror waking up one day to find your site is 99.99% supplemental with all cache’s being of the sign-up page.

  237. I hope Matt decides to revisit this topic soon, since it’s clear there’s
    a lot of disagreement.

    FWIW, our current solution to the registration problem is we base it
    on the referrer field. If the referrer value is from our site, we display
    the registration page. Otherwise, we provide unfettered access.

    That way, if you found us from the SERPs, you aren’t disappointed.
    But if you’re using our site and clicking around, we nag you into
    registering.

    Of course you can simply reload the pages to evade the registration,
    but it seems to work fairly well in practice (~ 10% registration rate).

    I would really like some more clarity on the registration issue though (regardless of whether we call it cloaking or not). I like the idea of showing whether a subscription is required for the data (that could potentially be specified in the meta robots tag).

  238. mla: the problem with your solution by itself is that there is nothing stopping anyone programming a bot with a fake UA and “clicking” one result from a SERP, “going back”, waiting a few seconds, “clicking” the next result, etc.

    You could theoretically end up with your whole site scraped (if that bothered you…it doesn’t bug me personally but I can see why it would bug others.)

    The other issue is that it doesn’t address paid content that search engine spiders can access but users cannot without registration (e.g. New York Times.)

    Don’t get me wrong…it’s not a bad idea. I just don’t think it provides the full solution and it wouldn’t provide any real help to anyone that’s being attacked by bots using ISPs that assign dynamic IP addresses.

  239. Dave (Original)

    RE: “In his last paragraph Matt says, “…Google searchers get the identical page to what Googlebot saw.” Which to me is pretty black and white, without a lot of shades in between. He says “identical”, not:

    ~ a sign up page for free or otherwise
    ~ as long as the NEXT page is identical”
    =========================================

    Exactly! I think the semantics in most cases is an attempt to keep the masses confused so this practice can continue.

    In my mind it’s patently clear that Google MUST stop any practice BEFORE it becomes perpetuated. Such a practice being perpetuated (right or wrong) would be the downfall of Google if they allow it.

  240. Such a practice being perpetuated (right or wrong) would be the downfall of Google if they allow it.

    But since you see this as black and white, it would appear that in the cases of the NYT and NPR, Google are allowing it.

    That’s why I’m hoping Matt will comment about it.

  241. Dave (Original)

    RE: “But since you see this as black and white, it would appear that in the cases of the NYT and NPR, Google are allowing it.”
    ==========================================

    I don’t know, but IF they are it just goes to show that it IS being perpetuated already with the hollow defense of “but they do it..”.

  242. Multi-Worded Adam: yes, I know my solution doesn’t solve many of the issues.

    I simply wanted to encourage subscriptions but still be well indexed and give users the content they found on the SERPs (since the SE policy on that is still very unclear to me and I wanted to play it safe).

    But yes, it’s trivial to circumvent.

  243. Content-negotiation to suit the client *enhances* accessibility if done right.

    An SE’s bot is merely one client amongst many, and one that usually fails to set ‘Accept-Language’ for example…

    Rgds

    Damon

  244. Hello Matt,

    I report spam almost daily and include very good descriptions
    to help find and kill it, but I see sites I reported multiple times over
    time still on a “pole position” for unrelated terms driven by it’s
    KW spam.

    The spam is not cloaked, it’s in a color so that it’s barely visible,
    but it’s 100% non-human writing, just keywords over and over
    inserted into every site footer automatically.

    Keyword1 Keyword2 Keyword3 Keyword1 Keyword2 Keyword3
    Keyword1 Keyword2 Keyword3 Keyword1 Keyword2 Keyword3
    Keyword1 Keyword2 Keyword3 Keyword1 Keyword2 Keyword3
    Keyword1 Keyword2 Keyword3 Keyword1 Keyword2 Keyword3
    …. (more than a 5 pages of this with diferent KWs in small type) ….

    Best regards,
    Sergey – not Brin 😉

  245. Have WMW changed their tune, or just being a little more subtle.

    I tried to get to this page (linked on their home page)
    http://www.webmasterworld.com/domain_names/3287198.htm

    And got told:
    >

    Many other WMW links work, so they may be trying to hide beneath Google’s radar

  246. In the area of Cloaking & redirects,One thing I have always wondered & never got a sensible answer on is weher do doamin alias’ sit in this area.

    If you have a a well pageranged page on domain abc.com, and want to move it to 123.co.uk on the same server if you move the pages to 123.co.uk and then set abc.com alias how will this affect the pagerank and will it be transfered to 123.co.uk?

  247. Another one to add to the list of definite media company cloakers:

    http://www.Unison.ie

    The website of the Irish Independent. They simply use the UserAgent. Nothing too technical about their cloaking. Easy to identify.

  248. In Search results: “Do Shareholders Have the Clout to Rein in Excessive Executive Pay …
    Warren Buffett’s latest attack on excessive executive compensation is another chapter in the long-running saga about how much to pay top business leaders, …”

    page: knowledge.wharton.upenn.edu/article.cfm?articleid=780

    But that isn’t the page the site gives web browsers who are instead forwarded to:

    http://knowledge.wharton.upenn.edu/signup.cfm?CFID=6066164&CFTOKEN=76742544

    As I stated before I think having these results might well be fine but options should be provided to users to exclude these sites normally (but you can click an option to provide links that are matches but require additional hoops be jumped through – hopefully the option could be separated – show free registration sites and show sites where pay is required to view results. Also over time this should have the option for me to tell Google which sites I want included by default (say I have subscriptions to 3 sites already…). When I suggested this years ago I figured it might be a bit early to expect. I would have expected it by now.

  249. And so the war begins…

    http://www.washingtonpost.com/wp-dyn/content/article/2007/04/06/AR2007040601967.html?hpid=sec-tech

    If Google is not already planning to have ‘notification of paid registration requirement’ in its arsenal, I think it’s going to need them before too long.

  250. Here’s another one for you to check out Matt, http://www.retailwire.com

    http://www.google.com/search?num=100&q=site:retailwire.com

    Every page clicked I found sends you to another one of those FREE registration/name harvesting pages.

  251. Matt, I really appreciate your Blog and the fact that you have taken the time to start addressing this issue… I am wondering if you are willing/able to comment on the statement made by Adam Lasnik :

    “On a happier note, my colleagues and I are working on an arrangement which I think you’ll be pleased with… balancing many Webmasters’ interest in requiring community membership or signin to content-rich pages while still showing content in Google’s search results. Stay tuned 🙂 (we’ll make an announcement in the Webmaster Central blog)”.

    In reality a solution such as this would ultimately end this debate! As a software developer for a national newspaper, this type of functionality is what my employer is demanding… it would allow me to legitimately encourage Googlebot to crawl my sites login protected content without the fear of losing my job over one or more of my sites being blacklisted…

    This statement was made almost 5 months ago, and there are droves of people anxiously awaiting any snippet of information that might divulge the status of this type of mechanism.

    Thanks
    -Jake

  252. Hmm interesting. I think it is a bad idea for Google to start picking on sites like this which provide massive loads of quality info when Google hasn’t yet cleansed their engine of REAL cloaking. This is not REAL claoking. This is just identifying the googlebot and letting it crawl the site while making users login. Is there anything wrong with this? No. The guy has created a huge forum and if he wants users to login to see it he should be able to request that and Google should not force him to provide free content or they won’t index it.

    Work on getting rid of spam and real cloakers and then when you got that under control maybe you can think of other things. This is NOT a problem, it does not put your index in a bad light, it does not hurt anything. I will actually click a WMW listing over some others just because I know they usually have good info.

    So in conclusion don’t do it. I know you probably won’t read this or if you do you will ignore it but I honestly think this is a stupid waste of time. And having said that I have no affiliation with WMW, I am a registered user there and have enjoyed alot of good discussions there. I think by removing them from the index would be very foolish. This isn’t like the BMW cloaking we saw a while back.

    For anyone who read all that thanks! For those who didn’t, don’t comment.

  253. He is a strange guy, but the forum has alot of good content. The members are generally polite and informative with their posts.

    But that has nothing to do with the debate of whether or not the cloaking the forum does should be allowed or not.

    I personally think it should be allowed. I have never used this method,I thought about it but so far all my articles are free to read by all.
    My thing is that this is not the original idea of cloaking. The idea of cloaking is to trick google by showing them a keyword rich page(generally a garbled mess to humans) but the real people see something different.

    This is not the case at all with WebmasterWorld. No instead he is just making them login/register to see the content. He is not trying to trick Google by showing it another page or anything. He just wants money.

    Of course I can see where this would be very annoying to some people if they clicked a result and had to register(pay) or login to see the site. So I can understand that too.

  254. Well I can see your point. If everybody did it… the point is they don’t. Just one. The question is: should one site be allowed to do something just because it is singular?

    Matt I got a question if you got time. A guy over at seochat said he had a problem similar to this and to combat it he offered each user 5 free clicks. So users could see 5 clicks deep and then had to login or register to view anymore.

    What do you think to that alternative to this problem?

  255. Very well said – from a highly respected name in SEO!
    Thanks for all of your great articles!

    Best regards,
    Trond

  256. experts-exchange.com is doing this now, too.

    Googled for “outlook delay in typing”. This result appears:

    delay when typing in Outlook 2007
    Question: I have a significant delay when typing emails in Outlook 2007. Sometimes I will type 3 or 4 words before they will appear on the screen. …
    http://www.experts-exchange.com/Software/Internet_Email/Email/Q_22628261.html – 67k – Cached – Similar pages

    Clicking the link itself results in all the text being fuzzed out and the only content is the question. Clicking the “cached” link results in seeing the responses. They’re sending Google different information than visitors.

  257. Shawn – very interesting find. If you view the source code of that page, you can see that in addition to making the text “fuzzy,” they’ve garbled it. But they’ve used a simple ROT-13 decoder on the text, so you can easily convert

    >> V qba’g unir n svk sbe vg ohg lrf V unir frra gung gbb.

    to

    >> I don’t have a fix for it but yes I have seen that too.

    Or, it might be even faster to ditch Outlook 2007 😉

  258. Checkout my page if you got some time. WMW still cloaks but I guess they do it based on country (geolocation). Maybe you, from the US, can’t see this but I bump into this screen periodically.

    It really looks bad for Google as it appears you’re supporting their cloaking.

    Hope you notice this message!
    Thanks.

  259. It seems to me that google supports cloaking, i’ve been reporting one website for about 3-5 months, and it is still showing #1 in search results for many popular queries on google.

    Check out this blog where the problem is described entirely:
    cloaking-google.blogspot.com

    Long live google and cloaking on it !

    Bastards

  260. @Shawn about expert-exchange.com

    If you scroll waaaay down… past all the “answer only available for registered users” type posts, then past the huge menu, then down even more… you will find all the answers on the page.

    I think they just re-arrange the page elements with java or something.

  261. If google really wanted to apply the policy: “users should always see the same page that Googlebot saw”,
    They would not allow website like
    expedia.com
    hotels.com
    opodo.com
    (just to mention some travel sites)
    To show a completely different homepage (automatic redirection depending on ip geo location) depending if you are a bot or not. Bots don’t get any redirection.
    In addition to this if this was an issue why they are not even checking the Ip? or other technique to hide this fact.
    They all use simple User-agent detection. And even a 301 redirection…. Ip geolocation is now permanent?!

    Also this question: Can a site with a pr that is less than 6-7 that these sites have, do the exact same policy without risking being banned?
    It would be very interesting to have an answer from Matt Cutts on this…..
    But my guess is that he is going to skip them completely…

  262. @ Jeff Said and Shawn about expert-exchange.com
    >If you scroll waaaay down…
    Yes, but this content is only viewable if you send a google ref. If you view their page directly you won’t see any content.

    In my mind this is cloaking as not everone sends a ref with his browser.

  263. Hi Matt,

    1. Scribd definitely serves text to Googlebot (instead of the regular flash content) Would that be seen as cloaking? And if yes, why is it still in Google’s index.

    2. Also, I always see thousands of links on any Scribd page linking back to the same page with different link text that they retrieve from analytics claiming that users landed on that page using those keywords. Now to any SEO professional, this is clearly a spammy technique, serving tons of anchor text to the bots while delivering 0 usability – rather adding to the frustration of users as those links do not lead anywhere but the page that you are already on.

    I wonder why such spammy techniques are not discounted in Google SERPS. We see it constantly popping up in search results while users are getting frustrated. I recently recall a tweet by a user frustrated with Google serving irrelevant content from Scribd. I can hunt for it again if you want it but I would rather put your Google searching abilities to test 🙂

    Pete

  264. Hi Matt,

    I had a question regarding cloaking of urls query parameters.
    Our application uses query parameters for tracking user flow and we don’t want search engines to index the query parameters as this will pollute our user tracking metrics.
    We remove all the tracking parameters when the crawlers access the page and add the tracking parameters only when a user is viewing the pages. Does this also qualify as a cloaking technique and will be qualified for a penalty.
    For ex.
    On page A there is a link to page B:
    http://example.com/pageA.php?src=pageA — For users
    http://example.com/pageA.php — For search engines

    Amit

  265. Hi

    I have a question… I recently bought a 3D flash magazine which claims to be search engine friendly… An example of the search engine friendly page which it creates is as follows:

    http://www.marinews.com/Articles/TBF/201005_TBF/Binder/TBF_201007_3D/4.html

    However, if you look at the source you will find the content within the source code same as what is being displayed within flash.

    However, the content in flash is actually embedded and the content which is being shown to search engine is within the page. The reason why it is not shown to user twice is because that content is hidden behind the flash frame by making the flash frame to 100% height..

    See the example where i reduced the height of flash frame and the content started to get visible.

    Can Matt Cutts or someone else throw some light on this whether this would be consider cloaking or any other black hat technique and if there is a chance of getting a ban in google ?

    Any help will be appreciated.

    Thanks
    Mohit Sareen

  266. Sorry i forgot to post the URL with reduced height of flash

    http://developmentserver.info/demo2009/marinews/test/index.html

    Regards
    Mohit Sareen

css.php