A quick word about cloaking
(Philipp Lenssen and I started talking about cloaking in a corner of the web, and I figured it would make sense to talk about cloaking in a separate post. Consider this a me-typing-this-quickly post, but better to get something down than to not get a chance to talk about it.)
Cloaking is serving different content to users than to search engines. It’s interesting that you don’t see all that much cloaking to deliver spam these days. If you see people doing spam, they tend to rely on sneaky redirects (often via JavaScript) more than cloaking. For example, a blackhat might make a doorway or keyword stuffed/gibberish page plus something like a JavaScript redirect to go to a completely different page.
Here’s the recent timeline of Philipp Lenssen talking about cloaking and WebmasterWorld (WMW) as I see it:
- Philipp wrote this post in late November: http://blog.outer-court.com/archive/2006-11-28-n23.html
The basic point was that if you searched for [php-based cms] and clicked on the #1 result (which was WMW), you would get a registration page rather than the page that Googlebot saw.
- I didn’t have the cycles to deal with it right then, but earlier this year I made it clear that WMW would be removed if it met the definition of cloaking when I tested it.
- I believe the administrator (Brett) of WMW made code changes to the site so that WMW would not be considered cloaking.
- I recently tested with the example Philipp originally mentioned. I did a search for [php-based cms], clicked on the #1 result, and got the same page that Googlebot saw with no registration page.
Those code changes address many of concerns I’ve heard regarding WMW (that users who click on the results don’t get what Googlebot crawled).
So I consider Philipp’s November article acted on. Philipp’s December post about it (http://blog.outer-court.com/archive/2006-12-13-n85.html) cited the same search for [php-based cms], so I consider that acted on as well.
I believe that takes the timeline up to February. I’m aware of two other posts Philipp did on this topic, both in February. The first is http://blog.outer-court.com/archive/2007-02-05.html which says that if you go to http://www.webmasterworld.com/forum44/1287.htm that you get redirected to a registration page. When I tried it just now, I got the actual page with a “Welcome to WebmasterWorld Guest from (ip address)”. The question I’d look at for this report was if you typed this url into Google, and then clicked on the result–does the user receive the same content that Google saw?
The other article that I’m aware of is http://blog.outer-court.com/archive/2007-02-20-n47.html where Philipp states that sometimes Google allows sites to go against our webmaster practices. But that statement includes an asterisk; the disclaimer at the bottom of that post is “WebmasterWorld doesn’t always show the registration page; they sometimes show the content that was available in the snippet.” As I understand it, that disclaimer acknowledges that some of the time, WMW gives users what Googlebot crawled. When I get a chance to tackle Philipp’s most recent report, I’ll be looking at consistency: when a Google user clicks on a search result at Google, they should always see the same page that Googlebot saw. It will take me a little time to check out, because it’s a report of behavior that often meets our guidelines (e.g. cookies, referrers, IP addresses might all come into play), but I do intend to investigate this issue when I get the cycles. I won’t consider this issue closed until I have the time to investigate how consistently the return-the-same-content-as-Googlebot-saw behavior happens; it should happen for every click from a Google search result.
To sum up, we did take action on Philipp’s questions about WMW. I consider the issue in a much better state now, in that most (all?) Google searchers get the identical page to what Googlebot saw. But I still consider Philipp’s February posts open for investigation, and I will get to them, in the same way that I tackled Philipp’s first two posts about this.
Ram Manohar Tiwari Said,
March 3, 2007 @ 12:03 pm
Just wondering if I got the sum right that was used for SPAM protection
Just to let you know that I am not a SEO but I run my personal website related to my profession.
Actually, I have a problem with some of my webpages.
Can you please let me know if Googlebot can actually run JavaScripts? I saw a few forum posts that suggest, URLs can be extracted from the JS code. But the problem with my webpage shows that Googlebot can actually execute the javascript code.
I used javascript code to solve the orphan page peoblem of my framed website. [ as suggested in http://www.netmechanic.com/news/vol5/javascript_no7.htm ]
Within the last week, I found that my high ranking website on SAP ( ERP ) has lost its rank for a lot many keywords. But that’s not the problem, I want to discuss here.
One of my website page http://www.geocities.com/rmtiwari/Resources/Management/ASAP_Links.html ranks number one ( still is ) for the term ‘ASAP Methodology’.
However, from today onwards the title of the webpage has changed to the main.html title ( not ASAP_Links.html ), used in Javascript to solve the orphan page issue. That seems to suggest, Googlebot is actually executing the Javascript. And all my pages will eventually end up in having the same title.
Please suggest.
Thanks,
Ram
Harith Said,
March 3, 2007 @ 12:38 pm
Matt
Thanks for detailed feedback on the subject.
However, when I read comments on Phillipp Lenssen like the ones bellow, I just wonder whether he is singling WMW for discussions related to cloaking or for “other reasons”
“It’s *really* unfortunate that anyone (especially search engine representatives) continues to post to WebmasterWorld instead of an open forum.
Webmasterworld.com is easily the most deficient forum site I’ve encountered.”
“That depends…do you sponsor conferences and have google employees speak at them? If so you can do anything you want…if not, then you’ll have to play by the rules.”
“Androw, make sure you follow the precise technical implementation of WebmasterWorld. This should be safe, because WMV isn’t banned.”
Jenstar Said,
March 3, 2007 @ 12:40 pm
Since I do a lot of AdSense related searches, I know that any entry in the AdSense forum is not viewable unless you are logged in, regardless of IP. I was looking up something I wrote there on smart pricing while I was in London at SES last month, and I got rerouted to the login page. And I just checked it now from home, and same thing. The AdSense forum has done that for a couple years at least.
http://www.google.com/search?hl=en&q=adsense+smart+pricing+update&btnG=Search&meta=
YPN forum is the same, I don’t frequent the others as much to know.
However, I suspect Brett will change that query to give the true page as soon as he sees this
jeremy schoemaker Said,
March 3, 2007 @ 12:43 pm
Well… there is a ton of evidance of this going on. Is this investigation going to close when webmasterworld has enough time to fix it?
john Said,
March 3, 2007 @ 12:43 pm
Matt-
Great post - what bout people who use counter scripts to gain first page rankings for impossible key words? Mesolink.com comes to mind as they pretty much own the mesothelioma space and have done so by palcing thousands of counters on unrelated web sires and .edu’s thus gaining them #1 position google - even over the national cancer pages for words like mesothelioma etc… all they are is a lawyer looking for and reselling leads - the site is hardly even maintained.
Web sites that have used methods like this should looked into.
graywolf Said,
March 3, 2007 @ 12:53 pm
Ok so how is this query not cloaking
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=cra&q=new+york+times+january+2006&btnG=Search
screen shot in case you see something different
http://farm1.static.flickr.com/152/409090766_b136513bef_o.png
I don’t see any of the text from the snippet “Go get yourself fitted for a new chain-mail vest. A bevy of experiments in” on the page I see. Since they have caching turned off (a good idea when cloaking) I can’t see exactly what you saw but your snippet tells me what you did see at one point in time.
here the landing page I get
http://select.nytimes.com/gst/abstract.html?res=F10616F73D540C718CDDA80894DF404482
Philipp Lenssen Said,
March 3, 2007 @ 12:56 pm
Harith, those comments cited above, except the last one, aren’t by me. The last one by me is a very simple reasoning: if site A does something and isn’t banned, then you can do the same if you follow site’s A process, because Google uses the same guidelines and algorithms to ban or not ban sites (hopefully). For the record, I think WMW is a cool resource, I’m registered since a long time, and I don’t have any personal “history” with the site or its owner, as you seem to imply.
Matt, as for your explanations, I think we still have some way to go to solve this. I don’t know if WMV changed something since my first post, it’s quite possible, but what I now often get is this behavior: the first click on the site shows the content, and if I click on it a second time from Google, I see the registration page. However, I can also not consistently reproduce this. I consider the behavior when I am forwarded to a registration page a “sneaky redirect” because I’m not getting to the content that the snippet presented, and “sneaky redirects” aren’t permitted per Google’s webmaster guidelines. If WMW changed their code in response to the discussion we had, mostly showing the real content now, then I think that’s a great step in the right direction, albeit they didn’t get rid of all “sneaky redirects” yet.
Brett Tabke Said,
March 3, 2007 @ 12:59 pm
First, thanks to Matt for the heads up. As most know, wmw has been the target of extreme amounts of bot activity over the years and has taken proactive steps to fight it. Our first and foremost job is to the regular members of the site. I have done everything I can think of to stop the bots from the cable and dsl isp’s. I have even gone so far as to ban entire tlds some of time (china, russia) where heavy botnet activity exists.
I think we have finally found a system that everyone can live with and keeps the content open for everyone, but slows down the bots. I say slows down, because there is no way to stop them.
Thanks Matt
bt
Bob Gladstein Said,
March 3, 2007 @ 1:07 pm
Matt, are you familiar with this thread at the SEW forum regarding nytimes.com? At the beginning of the discussion, Danny Sullivan stated that he believed they were cloaking. By the end, he’d changed his mind about it.
In my opinion, it’s cloaking.
Brett Tabke Said,
March 3, 2007 @ 1:11 pm
Phillip, yes it has changed since your first post. I am not going to detail it, since clearly, some of the botnet operators are prime readers right here. I actually built the new system to matts specs. I don’t blame anyone for being upset that there ISP was on a required login - I also get miffed when we click through to NY times and are forced to login.
On the other hand, I am bothered that some of the criticism of those policies always come from advertising program supporters and operators. They have no understanding of the risks and problems associated with trying to run an essentially free subscription site. Go to copyscape and put in webmasterworld. Thousands upon hundreds of thousands of messages ripped off. Entire sites exist just to rip off members (Like even Google Guy and MSNDude). How can you possibly compare a site that lives in that envirnment to a normal site based on traffic is everything advertising? It is a completely different proposition with different needs and requirements that aren’t based in traffic for traffics sake.
Philipp, if you want a tour around the back end to see how it works, I would more than be happy to show it to you if you wouldn’t publish specifics, but aggregated data on it.
bt
Matt Cutts Said,
March 3, 2007 @ 1:31 pm
Harith, I honestly think that Philipp just wanted to make sure that Google’s quality guidelines are interpreted fairly and consistently, and he’s right to want that.
Philipp, I made the edit for you and deleted the extra comment. As I noted in the post, I believe WMW did make code changes since your first post to deliver to users what Googlebot crawled. And I do consider your February posts still open and will investigate them in more depth.
JohnMu Said,
March 3, 2007 @ 1:35 pm
What bugs me about WMW is how some people see the login pages sometimes and others don’t. It feels like they’re just playing games with the users. What if it only displayed the login page 10% of the time, it would be almost impossible to prove. What’s wrong with just doing it right? Considering they have a whole forum on cloaking, I doubt they’re going to drop it completely…
What is the official take of sites like the NY Times which show the full article to the Googlebot but only the first section (snippet) to the non-registered user? It’s easy to tell that the Googlebot it getting different content (from Google’s snippet). Is that close enough?
Example: [Attacking online China]
Shows a different snippet in the search results than in the abstract-page that solicits buying the full article or a subscription.
Note: I don’t want to single out any site, it would just be great to have a clear guideline :-).
How much cloaking is ok? Is removing the “filler” navigation/sidebar ok? How about removing ads? Javascript? Session-IDs?
Ram Manohar Tiwari Said,
March 3, 2007 @ 1:43 pm
I think, i jumped into somekind of a personal and rather historical discussion. Sorry about that.
Actually, I tried to post my question in one of the forums and incidently, I found WMW on the google search.
WMW seems to be a big forum [ almost sounds like a BMW ;-)] but they charge for the membership, it seems. Anyway, I will try my luck on other forums.
feedthebot Said,
March 3, 2007 @ 1:47 pm
I agree with Phillip on the whole thing with WWW being pretty sneaky, and I have always had confidence that it would be looked into.
I am writing the guideline descriptions, but I am a bit stuck on defining and explaining to the general new webmaster what a “sneaky re-direct” is. I know there are many examples, but any suggestions by you Matt, or anyone reading this, if you know of a good article or someplace Google has defined “sneaky” I would appreciate it if you could pass that info to me.
I get alot of questions about it.
I see “cloaking” as well defined, but in all honesty the best example I have yet to see of something being “sneaky” is WWW redirect to that login screen. I will not use that in my Feedthebot description, but it does seem to be sneaky.
Let me know about anything that would help me define “sneaky re-directs!
pat, guidelineguy@gmail.com
Nick - I think the original Nick here. Said,
March 3, 2007 @ 1:48 pm
In my opinion, no amount of cloaking should be okay. As a user, I don’t want to see something juicy on the search results just to have to hit the back button as soon as I hit the page.
It would probably be expensive but, it seems like Google should have two bots. One Google Bot and one that identifies itself as IE. Then the two results should be compared and the sites that don’t match could be easily dropped. Like I said, expensive and would take up bandwidth but it would clear the whole problem up in one shot.
4Eyes Said,
March 3, 2007 @ 1:49 pm
The reason that WMW cloaks is pretty clear - money.
It is simply playing on the fact that many newbies get confused between the free membership and the paid membership and cough up the dough.
It is as cynical and manipulative as any other form of cloaking and as such needs treating in the same way.
FWIW , I am neither pro or anti cloaking. I am anti hypocrisy.
In a perfect world, Google would have an ‘intent’ detector, and in that perfect world WMW would be banned.
Michael Martinez Said,
March 3, 2007 @ 1:57 pm
I still occasionally get the registration screen when clicking on links to the WMW discussions. The system has a way to go but there has been improvement.
It would still be nice if Google would do something about the academic paper archives that cloak like spammers on speed.
Brian Donovan Said,
March 3, 2007 @ 2:00 pm
Springer does the same thing with their journal site(s).
I was searching for a paper by Feigenbaum, “Quantitative Universality for a Class of Nonlinear Transformations”, and popped the title into Google.
The top result is always:
> [PDF]
Quantitative universality for a class of nonlinear transformations
> File Format: PDF/Adobe Acrobat
> Quantitative Universality for a Class of. Nonlinear Transformations. Mitchell J.
> Feigenbaum. Received October 31, 1977. A large class of recursion relations …
> http://www.springerlink.com/index/Q555G62245040133.pdf - Similar pages
Clicking through, however, you’re redirected to http://www.springerlink.com/content/q555g62245040133/
… which is just an abstract. The paper itself is behind a paywall.
I run into this every time I Google up a paper that’s held by Springer.
Johnny Sutherland Said,
March 3, 2007 @ 2:10 pm
Please don’t ban this site, but I have the same problem with this site - http://mail-archives.apache.org/ - content listed in google v content found after clicking on the link. It is a forum and very frustrating when you think you have found the post you want in google but cannot be found after after clicking on the link without navigating down the forum.
Brett Tabke Said,
March 3, 2007 @ 2:18 pm
We did detail a few bits about the new system in a post,
http://www.webmasterworld.com/webmasterworld/3270248.htm
The the time element is probably why you had to login Michael. We have dialed that down to as short as I feel is ok - but you still have bots that slow crawl at a page view every 2-3 minutes - so it still isn’t perfect.
Philipp Lenssen Said,
March 3, 2007 @ 2:23 pm
> Example: [Attacking online China]
> Shows a different snippet in the search results than
> in the abstract-page that solicits buying the full article
> or a subscription.
Google shows this URL (shortened):
http://www.nytimes.com/2007/02/05/technology/05marx...
However, you’ll be redirected to this URL:
select.nytimes.com/gst/abstract.html?res=F00…
And interestingly enough, the NYT does not allow Google to show a cached copy of the page. Utills once said ( http://blog.outer-court.com/forum/40604.html#id40612 ) that the NYT “fades out” pages, meaning they’re live for everyone for a while — human visitor, Googlebot (no cloaking) –, and then after a while they’re starting to redirect to a registration page. I don’t know if that’s the case, or in case it is whether or not that fits the “sneaky redirect” definition (while I don’t like this behavior I think a website has the right to change page behavior over time, and at least the old page content will be ultimately “faded out” of Google’s index too). But I’m curious to hear Google’s, e.g. Matt’s statement on this, because webmasters need transparency over these issues.
Brett Tabke Said,
March 3, 2007 @ 2:54 pm
Aren’t there existing agreements between Google and the NYT that allow NYT to do that? Which I always thought suprising giving NYT comittment to transparency.
LifeInSearch Said,
March 3, 2007 @ 3:34 pm
Hi Matt. What about multivariant testing? Would G consider this a form of cloaking. In the strictest definition, I suspect that they would. How would you suggest we go about performing such testing without offending the almighty G? Thanks in advance for your time
Bentley007
JLH Said,
March 3, 2007 @ 3:36 pm
Congratulations Matt on this open discourse! Scrapers stealing your content sucks, but the playing field has to be the same for everybody. Not everybody is going to be as ethical as Brett and use cloaking to generate their own mailing lists or worse yet to sell subscriptions based on the free snippet provided by google. I for one wouldn’t mind seeing a little (subscription required) next to a result if that was true. But then the question becomes, where should those rank? With the regular free results or on their own? or only if free resources are low? Tough questions, but that’s why you get the big bucks.
Success has a price and perhaps that price is to decide if you want to be in google’s results or ban all bots and stop scrapers, of course the price of that is that someone will find a way to steal the content and they’ll rank where WMW should have.
My only request is that ALL sites be treated the same. Google’s always prided themselves on no one being able to ‘buy’ their way to the top or into the index, and money isn’t the only currency. So with a move like this post, you are enforcing that directive even more.
Oh, and hats of to Philip for hanging tough on this and getting results as well as to Matt for giving the results that we’ve all been waiting for.
mrg Said,
March 3, 2007 @ 3:52 pm
Matt, I understand cloaking is bad and that sneaky redirects are strictly no-questions-asked ban behaviour. Good for wmw that they were given an opportunity to redesign the forum to Google’s specs before any action was taken. Perhaps such an important resource deserves it…
What would you recommend to more mortal webmasters with sites a lot less important than wmw, whose servers got hacked and websites banned because listings in Google were -unknown to the owner- being cloaked and sneaky redirecting to some other site?
Reinclusions are not being responded to.
Doug Karr Said,
March 3, 2007 @ 4:13 pm
Matt,
Isn’t this also utilized for valid services as well? I’ve noticed this method utilized by sites that are heavily flash… they push their content through some type of server-side scripting to both a page and Flash file, then they utilize Javascript to replace the content client-side.
Regards,
Doug
Neal Said,
March 3, 2007 @ 5:54 pm
What peeves me are all the spam blogs on blogger that do the same types of things. I follow a Google blog search link to a blogger blog and I either get redirected completely or I get fed something totally different (via javascript) and there’s no report button up top to click on.
Aaron Pratt Said,
March 3, 2007 @ 6:52 pm
I notice a bunch of cloaking software sites with graybar pagerank out there, looks like I will have to prune a few folks and links from my blog to break negative link & keyword associations.
Nothing personal guys, seen that “Sorry but I am going to have to block you” commercial?
;-o
Wheel Said,
March 3, 2007 @ 7:12 pm
I fail to see why all sites have to be ‘treated equally’. Google’s intent is to stop spam, not have a web love-in. The only reason to treat sites equally is if they can’t be bothered to hand review stuff or look at exceptions. And I certainly don’t want that.
Some sites will have valid reasons for cloaking. Perhaps WMW has a valid point for example - I know if I was paying a ton of cash to feed the bots each month, and if cloaking would fix that, then I’d be cloaking too. And at that point, I’d probably appreciate Google being able to listen to reason on the odd case instead of the rubber stamp of ‘all sites must be treated equal’.
That being said, if the NYT’s intent isn’t cloaking for cash, then perhaps some type of capcha would prove their sincerity more than a registration screen.
Marcia Said,
March 3, 2007 @ 7:18 pm
The true page that visitors see IS being presented. The URL I click on in the Google SERPs is *exactly* the precise one I get to see, every single time.
I spent a good 4 years with as much on-site time at WebmasterWorld as just about anyone else around, and I clearly recall times when the site would slow to crawl because of rogue bot activity.
Any useful, popular site with a serious enough problem to negatively impact their entire user base would have to be NUTS not to exclude unauthorized access.
http://www.google.com/support/webmasters/bin/answer.py?answer=35769
“Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service. Google does not recommend the use of products such as WebPosition Gold™ that send automatic or programmatic queries to Google.”
It’s like the difference between practicing curative and preventive medicine. Unfortunately, there are “diseases” that affect websites that will ravage them unless preventive measures are taken beforehand. Like captchas and requiring registering to control access and damage by spammers, scrapers and thieves.
tips Said,
March 3, 2007 @ 7:54 pm
Glad to see that this issue is finally getting resolved (I also wrote a long article about it back in November). What if many sites started doing this kind of cloaking/login screen? It would would be terrible for the Web in general. Thanks for fixing it.
Shawn K. Hall Said,
March 3, 2007 @ 9:08 pm
It appears to be the second click while at WMW that now imposes the login form; third if you have referers disabled. This is true even if you have the same browser window open and just click > back > click > back > click.
Shawn K. Hall Said,
March 3, 2007 @ 9:18 pm
Here’s another sample:
http://www.google.com/searchq=site%3Awebmasterworld.com+php
Click on a link, then back, then a different link, and back… the third or fourth link I try, linked from Google, requires login.
Mike Schinkel Said,
March 3, 2007 @ 9:49 pm
Matt: The subject of your post caught my eye. I have always wanted to discuss cloaking but not in the same context as you are discussing with WMW.
I want to describe an example of cloaking that I think *should* be okay, and I want to get your take on if it is or is not, and if not why not.
Let’s say you have a “professional” blog with content on the page, but for each Permalink page there is a lot of cross promoting of other posts and categories, etc. There are also all the advertisements on the page. And affliate promotions. And so on.
So looking at the markup, the page content might comprise less than 1/2 of the total markup. That markup is of course there to market addition content of potential interest to the reader, but in reality it is not really part of the page’s content.
On the other hand, when Googlebot visits the blog serves a very clean version of the page with only the content; no ads, no cross-marketing links, no affliate links, not footers, etc. This way Google can actually see what the page is about and not have all the cruft to wade through. It would be a lot like the information a “Print This” link would display.
So is this *bad* cloaking, and if so, why? Also if so, is there not a way that a site owner can provide hints to Google’s indexer, like maybe recognizing that what is contained within a DIV tag with an attribute of @id=”content” is the page’s content and the rest should be ignored for indexing much like how AdSense allows similar?
P.S. My blogs currently DO NOT do what I asked about so don’t go trying to ban them or anything!
Vic Said,
March 3, 2007 @ 10:13 pm
Matt - Isn’t this along the lines of the First Click Free program? I pushed for it back at my last company because they were whitelisting Googlebot so that all articles would be indexed but if a user clicked on an article from a SERP they would be redirected to a login page. But with First Click Free, the document stated that we should continue to allow Google to view the content based on the User-Agent and that if a user clicks on a link from Google, that page needs to be provided to the user without registering. If it’s a multi-page article, the user needs to be given all pages of the article. If they navigate anywhere else, a login prompt can be thrown at them.
Craig Said,
March 3, 2007 @ 10:26 pm
Speaking of cloaking, you should check this out.
http://www.towersystems.com.au/fhn_blog/archives/2007/03/rsvp_scam_follo.html
Brett Tabke Said,
March 3, 2007 @ 10:30 pm
Vic, that is exactly what we have implimented. (see: http://www.webmasterworld.com/webmasterworld/3270248.htm ) After that, only a selected set of isps are required to login. Then only those isps that have had a history of bot abuse. I have been debating detailing it entirely, but how do I do that and not expose the system to 10 fold the abuse by letting bot runners know exactly the weaknesses in the system? When some of the abuse is (imho) by competitors, how can you detail what is effectively a security matter without making the system moot? That is the reason we don’t pull a google and throw up a “you look like a bot” page and then require them to login - they’d know what triggered block action.
Vic Said,
March 3, 2007 @ 10:44 pm
We faced the same dilemma because in theory anyone can game the system by just searching in Google, get the article they want - go back to Google and so forth. We had our own very powerful search tool but if a user wanted to use it, they’d have to login. Also, if the page is cached, then a user can just view cache to see the article. The other part of the first click free is that by allowing Googlebot to see the content based only on a User-Agent lookup, anyone can use any number of tools to pretend they are Googlebot and view all of your content.
TOMHTML Said,
March 3, 2007 @ 11:24 pm
Is a page (IP) cloaked? Use Google Analytics! You will see what Googlebot see!
dockarl Said,
March 3, 2007 @ 11:44 pm
If I were google I’d take this approach.
1. Flag any site that doesn’t allow caching as suspicious, and internally cache the result.
2. Send a ‘userbot’ (my term) that doesn’t identify itself as a bot and comes from an unkown IP pool and internally cache the results of the latter crawl. (geez, I get these all the time from China and southern asia, so it must be possible)
3. Immediately send back 1, and check whether content still differs.
4. Ban site.
Doc
German Said,
March 4, 2007 @ 12:10 am
If you want to see a cloaking, just look at the pay per view online newspapers (or online version of paper newspaper):
Google get a snippet and when you click on the link (or even on the cache) you get redirected to a page you have to subscribe to go further (i.e. to pay to read the article). Sometimes there is not even a cache, only a snippet of the article (in Google description) leading you to a registration page.
To me, it is the most annoying kind of cloaking and these results don’t deserve to be in Google. It is wasting my time when I look for information.
Tony Ruscoe Said,
March 4, 2007 @ 2:32 am
“Nick - I think the original Nick here.” and “dockarl”:
I don’t think it would be fair to simply spider a page using two different bots and see if the content returned to each differs because some websites have page elements that change dynamically on each visit. (Take a simple server-side hit counter, for example, or a “Random Tip” feature.)
If Google actually took an approach like this, they’d probably have to consider what % of the page content being served to each bot was different and maybe even request each page twice (or more) using the same bot to see if the page content usually differs… and even that wouldn’t be accurate all the time.
dockarl Said,
March 4, 2007 @ 3:09 am
Hi Tony - yup - granted not all sites remain exactly the same between crawls because of these minor elements.
What I’m talkikng about are sites line wmw and NY times that do (or have) entirely cloak content to users but not to bots, or vice versa.
A great example would be if I had a site that presented content to all bots suggesting it was a resource page for a topic such as ’search engine cloaking’, and in actuality, it was presenting ‘Hot Russian Brides’ to search engines.
We’re talking big diffs here, not fringe changes.
Doc
William Donelson Said,
March 4, 2007 @ 3:24 am
Matt, we use cloaking (or so I’m told by our hosting service) because our main site
http://www.armchair-travel.com
has the word “travel” in it, and some of our Virtual Tour customers have firewalls that block any URL with the word “travel” in them…
So we have an alias of our website
http://www.armchairvr.com
which is the same data but served under the “vr” rather than “-travel”.
We are not concerned with links into the “armchairvr” site, or listings for it in search engines. It is just a convenience for some of our clients.
I would guess that it is possible that someone would link to the “armchairvr” site without our knowledge (not likely, but possible).
How would you view that?
Thanks
William
arnoweb Said,
March 4, 2007 @ 5:07 am
Hello,
Just a comment concerning users who need to check if a page is cloaked.
Usually, you have to check the page in the Google’s cache.
But how can you do if this page uses the noarchive function ?
A method that maybe you already know consists of doing IP spoofing of GoogleBot.
To reach this IP, you have to use Google Analytics then select the function eyetracking of Analytics (number of clicks for each area on the page).
Previously you must create a new profile for the url of the new website.
Here is an example of a cloaked page:
http://arnoweb.free.fr/concours_crack_cloaking.jpg
Very interesting to identify easily webmasters who practive Cloaking without technical knowledge.
Doug Heil Said,
March 4, 2007 @ 5:27 am
lol
This thread is getting to the heart of the “never” ending debate of “What is cloaking”?
Okay; Obviously, I do “not” agree that WMW is cloaking as per the definition of cloaking as I see it. Cloaking IS search engine spam. WMW is not cloaking. Period. If you are detecting Googlebot “solely” and intentionally in order to server googlebot one page, and to serve a regular search engine user another page, then yes, that is cloaking and that is search engine spam. All WMW is doing is detecting ALL spiders it does not want to allow access to and denying them. Along with that, they are detecting a “signed-in” member as well and giving them the url. Along with that, they are detecting a real user who is “not” subscribed (cookies) and giving them the registration page.
NONE of this is “cloaking”. If you truly use the real definition of cloaking as being the detection of a search engine spider (googlebot) and giving that spider a different page than a user sees, then you are cloaking and you are implementing search engine spam.
In my mind; cloaking is “always” spam. It cannot be anything else. If this industry does not want to be totally transparent, then expand the definition of cloaking to other areas as well. What you will achieve is claiming that any site who is detecting the user agent “By Country” and sending them to a country specific page would also be cloaking. I could go on and on about this. If my site is detecting whether or not the user agent has flashed installed, and sending that agent to the appropriate page… am I now cloaking as well? NOT. Period.
WMW is Not cloaking.
Doug Heil Said,
March 4, 2007 @ 5:35 am
Further clarification;
I do not like the fact that I click on a WMW link in Google serps and get a sign-in page. No doubt about it,.. I hate it. But the difference is; I understand the “why” behind it. It totally has nothing to do with “cloaking” as cloaking is always search engine spam. I truly hate the idea of calling anything and everything cloaking. Cloaking is detecting googlebot “only” or yahoo and MSN, and sending those bots to a different page rather than what “sally” sees in the google serps. That is cloaking. Lumping together ALL forms of content delivery and calling it cloaking is not in the best interest of our industry. Not at all.
Doug Heil Said,
March 4, 2007 @ 6:00 am
Maybe a fix to this could be that Google puts a disclaimer on pages that require a sign-in? Maybe a meta tag that google would recognize could trigger this little disclaimer “beside” the serp listing? Something like meta name=subscribe content=requires sign-in. This could trigger the disclaimer on the serps. I don’t know what the answer is, and I understand the big problem. But I also don’t want everything to be called cloaking when it clearly is not.
My forums have a private area. Right now I am showing the signin (have to subscribe) to anything and everything that accesses it. Only those with the correct cookie gets to see the thread. But if I wanted to “detect” ANY user agent that is not subscribed and give them the sign-in page, but allow ANY user agent to see the thread that IS signed in, am I cloaking? Nope.
A simple little disclaiming beside the google listing would go along way to the confusion. The search engine user could clearly see they have to subscribe before clicking the link. But it gives the user the option of clicking… or not. It also would serve to give all pages access to the google serps without confusion. Just because a page is a subscription page should not be a reason to deny that page a listing in the serps. There has to be a way that all parties can live with.
Aaron Pratt Said,
March 4, 2007 @ 6:35 am
That’s it! A useless “user experience”.
I don’t have a “paid” membership to WMW and have always been bothered by any traces of it in the serps. People also link to it from their blogs, you go there and you can’t even signin unless you pay, lame!
I would weigh if this “claim” of bots is true or not, something doesn’t smell right.
Chris Said,
March 4, 2007 @ 7:08 am
Doug, thats your definition, I think most people would call what WMW was doing cloaking. As Aaron says, it makes a useless user experience.
Brett can justify it any way he wants, the fact is no other webmaster forum that I know of needs to do these things. WMW is neither the largest, nor the most popular, and content scraping is not a problem unique to them.
There are numerous ways to combat content scraping that do not involve cloaking. Firstly many scrapers do not spoof their user agent, so your first step is banning all said user agents. Secondly many do spoof their useragent but do so in an unstandard way (an IE useragent in a format no actual IE installation uses for instance). Detect and ban that. Thirdly once you do detect bad behavior, ban the IP (like with a brute force SSH login firewall). You can setup a bot trap with a link that is blocked with meta robots and robots.txt and ban any IP that accesses it. Banning entire abusive countries or IP blocks that are used by server companies and not IPs is another good step as well. Finally there are server modifications for Apache that can throttle users who request too many repeated pages.
No actual user is going to request 1000+ pages a day, but on a large site bots certainly will. Once a user reaches that level, give them extra attention, compare IPs with the IPs of known search engine spiders. If it doesn’t match, ban them. So they were only banned once they reached 1000 pages, thats like less than 1/1000th of your entire site, big deal. Obviously to implement his cloaking system Brett already has an IP comparison system, he just would have to switch how it is used.
Of course if the bot activity is slowing down your site to a bad degree, then it is probably time to upgrade your hardware or software. Install a caching system.
Finally, realize that if people are ripping off your content, you can fight back. Send DMCA requests out to their hosting company if they’re in the US and to Google. It works. If they’re some piddly website hosted in India, just try to ignore them. There is no way they’ll ever accumulate enough links to rank well for anything.
More or less every site I’ve ever published has been ripped off and republished atleast one time. It hasn’t driven me to cloak.
The fact is, no matter what Brett says is his reasoning, WMW benefits from their cloaking by increasing user registration rates while gaming the SERPs. Its wrong that Google should allow it, and I think Matt’s recent investigations were overdue.
Chris Said,
March 4, 2007 @ 7:15 am
Oh Ya, Doug, your examples present poor arguments. Private forums aren’t cloaking of course, no one has said that or would say that. The issue isn’t if a site requires registration or subscription, but rather if it tricks visitors into visiting by showing Googlebot special content different from what a first time user sees.
The solution for this is not a special icon or message in the SERPs, but that if you want your site to be registration or subscription only, you need to accept the fact that your content is not going to be indexed. You cannot have your cake and eat it too.
John Hunter Said,
March 4, 2007 @ 7:34 am
Good. This is a problem with that has bothered me for a long time. Here is one post from 2005 I made: http://management.curiouscatblog.net/2005/01/30/web-search-improvements/
I would think Google could come up with a user participation tool that could help identify this type of behavior. Just give people the ability to click something saying the search link didn’t return the expected results or something (I am sure Google can work on the details). I am sure Google can figure out how to separate, from that feedback, the problematic web sites from all the mistaken, malicious… clicks. Then those few sites, site sections and pages that systemically appear could be forwarded for human examination. This really doesn’t seem like it would be very tricky.
I suppose Google might worry about he confusion such an attempt might cause regular users. Ok, just make it an option. Then users that like me use Google alot and get annoyed at having to ignore those sites I have learned won’t provide the content Google says they will can choose to participate. Yes I understand this will create a self selected population identifying problems but I really don’t see that as a problem.
Another options I mentioned in 2005 is to give me a way of adding a list of websites I don’t want returned. Actually now I can just use a Google CSE to do this. In fact I just did - http://www.google.com/coop/cse?cx=002278424765197393586%3Alfichkez90e&hl=en I excluded webmasterworld.com to test it out and it does work. Less than 2 minutes to do. This doesn’t do anything for those who think it is unfair for some to have Google return cloaked pages as results, but it does help if you just want to exclude certain sites from your results.
Brett Tabke Said,
March 4, 2007 @ 8:29 am
> Firstly many scrapers do not spoof their user agent,
> so your first step is banning all said user agents
Done. Aside from the standard 20-25 bot names, the following are required login.
java , snoop, lwp, php, boitho.com, spiderman, downloader, Missigua, HTTrack, Fetch API Request, webstripper, Jakarta, IEAutoDiscovery, Zehuti, bot.mainseek.net, TMCrawler, megite, Bottino, QihooBot, Hatena, Missigua, Bookdog, NewsGator, API Request, Macan, Python, webmon, yoono, Boston, Jakarta
That entire lot is rarely seen and cause little problem and we are probably going to delete it as totally ineffective any more. Less than 1 in 500 bots use actual spider agent names.
> Secondly many do spoof their useragent but do so in
> an unstandard way (an IE useragent in a format no
> actual IE installation uses for instance).
There are over 200 known legit variations on ie names and no known way of determining which is a bot and which is not. We do account for some that are known.
> Thirdly once you do detect bad behavior,
> ban the IP (like with a brute force SSH login firewall).
IP banning is of limited use in a dynamic ip world. This is especially true where many top webmasters live in high density places like San Jose or Mountain View. One of the biggest problem areas have been cable modem users in the Valley. It is a hot bed of search engine activity and not unusual for us to ban an ip and get a complaint letter from a se worker the same day. eg: the ip was recycled by the isp. This exact scenario happened with AdWords Advisor who had the same IP on the exact same day we banned the ip for a 20k an hour bot running. You simply can’t go whole sale banning of ips without sooner or later having a problem.
> Banning entire abusive countries or IP blocks
> that are used by server companies and not IPs
> is another good step as well.
Not really. We did that in dec and again last month with .de. That in turn lead to some blog stories turnning up from Phillip. You can mass ban entire countries in a vacumn. You can require them to support cookies - if not login.
> No actual user is going to request 1000+ pages a day
We have more than 500 users that visit more than 500 pages a day - We have over 300 users that visit more than 1000 pages a day. We have another 50 that visit more than 2k a day, several moderators hit 3-5k a day, and it is not unusual for me or admins to have over 10k requests a day. There is no way to determine which is a bot and which is not.
> If it doesn’t match, ban them.
Nope - you would ban half of rr.com if you did that.
> Of course if the bot activity is slowing down your site to a bad degree
If left unchecked, it would be 10 to 1 bots to humans. The site is there for the humans - not the bots.
> Send DMCA requests out to their hosting
> company if they’re in the US and to Google.
It would take a full time legal person to do that right. How you going to afford that on a subscription basis? The only thing that would do is get you a whole lot of negative press and blog stories.
Multi-Worded Adam Said,
March 4, 2007 @ 8:32 am
I may be the only one who thinks this, but I really don’t think the whole WMW thing is that big of a problem, especially with what’s coming down the pipe.
*** STRICTLY SPECULATIVE COMMENTARY TO FOLLOW HERE ***
(Translation: don’t crawl up my ass if I’m wrong. This is just me predicting stuff.)
With personalization of results, it’s not inconceivable to assume that users will be able to tailor the results such that certain domains/sites do (or more importantly in this case, don’t) show up. You don’t want to see WMW at the top of a SERP for something? Filter out the domain. Since most people who would be looking for WMW-related stuff are web/SEO/tech-geeks anyway, this won’t be that difficult.
Besides that, most of us who have even remotely followed the WMW situation are aware that we will be required to register for the site. I can’t speak for anyone else, but if I see WMW on the first page of some ultra-geeky SERP I created, the red flag goes up in my head and I don’t click.
Don’t get me wrong: I won’t use anything that requires registration before viewing even the most basic of content. But it’s their site to do as they wish with, and they’re not doing anything wrong from a strictly SEO standpoint as far as I’m concerned. It’s a bad user experience, yes, but big G didn’t create it…WMW did. Let them fix the problem if they want to, or leave it alone if they choose to.
As far as the inconsistency of results goes (with some content showing and some not), I haven’t seen that personally, but if it is going on, again, that’s WMW’s problem.
We can all still vote with our keyboards and mice and go somewhere else, so why are we even talking about this? Let’s move on to a real issue, like unoriginal-content MFA sites or something.
Chris Said,
March 4, 2007 @ 8:55 am
Brett if you really have legit people viewing the site 1000+ times per day, and I really doubt it (strictly speaking, there are only 24 hours in a day, to read 1000 pages of content would be difficult even if it were your fulltime job). You can always increase the limit, 5000, 10000, It would be a simple matter to exclude registered/logged in people and moderators just incase you did have a user who viewed so many pages. With millions of posts even at a limit as high as 10,000 they’d still only get a fraction of your site.
Also, I didn’t mean you should ban western countries. I get tons of spam from .nl and .de but I don’t ban them. But Russia, Korea, Ukraine, China, Nigeria, etc. Definitely. You will of course undoubtedly end up banning some real users, but until these countries get real about policing their Internet usage its a valid tactic to combat not only site ripping but all forms of spam and hacking attempts.
Of course the main reason I doubt you’re honestly doing this for the bots is the fact that you don’t allow Google to cache your content. If you allowed them to cache it then, when presented with a registration box instead of the content alluded to in the SERPs, the user could simply view the cache. You don’t allow this, and it can’t be because of bots because…
1. Viewing Google cache causes no resource drain on your server.
2. If a person is smart enough to crack Google’s cache id algorithm, and to figure out a way to get a total list of all your urls to query the Google cache with, then they’re smart enough to write a user-agent-bot that can handle cookies and so both login and rip your site. Assuming Google doesn’t have any backend security to prevent cache abuse (and I’m sure they do).
You can try to justify the cloaking with a bot excuse, but how do you justify the nocache then? I think its obvious that that method is with out a doubt aimed at users, and not bots.
Brett Tabke Said,
March 4, 2007 @ 9:38 am
Chris, I am all for finding solutions to the issue, and appreciate the discourse.
> It would be a simple matter to exclude
> registered/logged in people and moderators
Which is what we do Chris. Those aren’t an issue. The issue is the one off that come in during a Google update and view a thread 20 times a minute for an hour waiting for a new message to pop up to see what other people said. How do you account for that? The only way we can is by requiring login from some that look like bots. That in turn leads us back to the same issue all over again. I am not going to deny them content just because they are over eager - that is exactly who the site is there for and remaining available is more important than anything.
Say you view 10 page views a minute for 10 minutes. (very common), and a admin sees it and requires your ip to login. How long do you leave that on? Do you do it for a day - a week - a month - a year - or permanently? We’ve been doing it close to perm for the last year. It is hard to go back and keep track of who were the most troubled ips.
I think people don’t appreciate or forget who our typical member is that visits the site. Webmasters are tech savvy, offline browser knowledgeable, independent, fearless, and a touch of an attitude. Additionally, we are the only major forum with flat easily crawlable structure that can be downloaded by every offline browser listed on Tucows. Very few of those same offline browsers can download a forum with cgi based urls.
> Viewing Google cache causes no resource drain on your server.
It is a liability that can’t be exposed. DMCA actions, take down notices, copied in content, flames or trouble posts, and mass whole sale site theft out of the cache - is a trouble waiting to happen and easily avoided without exposing your content to that.
> cookies
I have tested that and am reconsidering just pure cookies required - which would most certainly lead us back to the same issue of “what is a bot” or “What is a human”. If you did do a pure cookies require setup, you’d absolutely have to pure ip cloak for the engines, or you’d ban them too. So it is far from perfect.
Yes, there are numerous bots running in the forums that support cookies. It is dog easy to write a bot that supports cookies. However, when you require a validated login, it raises the bar and lowers the rate of abuse. You can also swap out cookies from time to time and rerequire a login at a threshold. Which we also do.
> But Russia, Korea, Ukraine, China, Nigeria, etc. Definitely.
We can’t do that on a server level, because we have legitimate members - many subscribers - from those countries. That’s who the site is there for.
What if we started to publish the log entries, the IP’s of the bots, and also the letters of complaint to the ISP’s? See here: http://www.webmasterworld.com/search_engine_spiders/3233952.htm
Also, see this article by Google’s Vice President & Chief Internet Evangelist Vint Cerf, “Botnet ‘pandemic’ threatens internet”
http://news.bbc.co.uk/1/hi/business/6298641.stm
Doug Heil Said,
March 4, 2007 @ 9:40 am
Chris; It’s not cloaking. You are using your definition. I’ll use mine. Mine is certainly more clear and certainly more user friendly. Using your’s; it would also mean that if I detect users with flash without flash installed, which also includes googlebot, and send them to the all html page, then I’m also cloaking, right?
I do agree with you that maybe people who have subscription based content should not expect to be in the SERPS. Maybe they shouldn’t expect it, but not allowing it in is hurting the user experience as well as you are not giving the user the “option” to click and register. Having some option is better than no option at all.
There are other things Brett is doing which makes this a difficult problem. There’s more to it than keeping out bad bots. As Adam said, it “is” his problem. I agree that it’s really not Google’s problem. It’s only their problem in that they should put in a disclaimer for the page in the serps. Making it Google’s problem says you are saying that Brett is spamming Google, which he clearly is not.
Doug Heil Said,
March 4, 2007 @ 9:43 am
Where is Phil Craven when I need him?
IncrediBILL Said,
March 4, 2007 @ 10:11 am
Matt,
I really don’t understand why you’re giving any credibility to Philipp Lenssen’s half-cocked opinion about WebmasterWorld cloaking when he obviously doesn’t know what he’s talking about. Philipp should be careful about calling someone a cloaker, and I’m surprised you even jumped on this, when in fact they are NOT CLOAKING as being labeled as such is just as libelous as calling someone a spammer when they aren’t.
I’ve seen what Brett’s doing and it’s not cloaking whatsoever. It’s called SECURITY and he’s setting some BAD NEIGHBORHOODS on the internet that are involved with scraping. This requires a LOGIN to access his site for some ‘net neighborhoods while others are wide open and see WMW content without limitation.
FWIW, if I’m not mistaken there are some major sites that allow Google to freely crawl their premium content but require a FREE registration to access that same content.
How is this cloaking?
Heck, I put a CAPTCHA up on first access from some REALLY BAD NEIGHBORHOODS as well just because they’re a mix of humans and bots it’s the only way to let humans in yet stop the bots dead in the water.
I don’t expect WebmasterWorld to change their behavior until they can effectively firewall the bots or Google can stop indexing scraped content on Made For AdSense sites which incentives them to attack WMW in the first place.
Overall, I’m really surprised about your attitude about what kind of access control security a web site employs, which is NOT CLOAKING, and would run off and post this based using such a half-baked misinformed article as Philipp’s as evidence.
flash freelancer Said,
March 4, 2007 @ 10:16 am
I usually use the swf object javascript source to embed my swf files into the htm document.
I serve the users that dont have the correct flashplayer (like the googlebot) installed a text alternative, is this considered claoking?
Doug Heil Said,
March 4, 2007 @ 10:21 am
Now tell us how you really feel Bill…..
So you see Chris; many actually have the correct definition of cloaking. BTW: Phil Craven shares the same definition, and many others. The reason? It’s a clear definition.
Chris Said,
March 4, 2007 @ 10:34 am
Cloaking detecting SE bots via IP, UA, or both and serving them different content than everyone else.
Detecting users who do not have flash installed and redirecting them is NOT cloaking because it doesn’t matter if you’re a search engine or not. I could turn off flash in my browser and see the exact same thing Google sees. That is the difference. Of course a better method is to have the normal textual version be the default version and forward/popup a window for people with flash, the opposite method.
What Brett was doing was detecting based on IP and User Agent and serving specific search engines different content based on those systems.
It wasn’t a case of a “bad neightborhood” like IncrediBill is saying, it was everyone who wasn’t a search engine.
Brett, obviously, has made some changes and will no doubt continue to make changes, and I’m not saying he’s cloaking now after the changes, but he was cloaking before. He might not be doing it anymore, but sometime after Nov of 2005 when he said he banned all bots but “wasn’t worried” about losing SERP placements he reversed course and started cloaking.
It is a good thing that Google has finally stopped what many viewed as preferential treatment.
And I like said above, WMW is not the largest nor the most popular webmaster forum, let alone forum in general. There are thousands of larger and more popular content sites and I don’t know of a single one that needs, or needed to, cloak to prevent site rippers.
Chris Said,
March 4, 2007 @ 10:42 am
Sorry Doug, your definition of cloaking just isn’t correct.
When it comes right down to it? Who defines cloaking? Not you, not me, probably the search engines… and I think the search engines have spoken.
The fact is, does it really matter what your definition is if no one uses it for the basis of whether or not a site gets banned or penalized? You can make up any definition you want, but unless its actually used by the people who matter (ie Matt Cutts & Google) then what use is it?
I don’t get why you’re defending a blackhat technique to begin with. Even if you buy Brett’s weak bot argument, there is no altruistic justification for disallowing Google’s cache.
If you’re serious about long term SEO do you really want to be supporting a pushing of the envelope in regards to black hat techniques? “Its okay to spam if you’re doing it for the right reasons.” Is that the example people should see?
I don’t think so.
IncrediBILL Said,
March 4, 2007 @ 10:46 am
>> Viewing Google cache causes no resource drain on your server
Chris,
My site is under attack much like Brett’s and I evangelized disabling cache in all search engines at both SES ‘06 in San Jose and Chicago and at PubCon ‘06.
One of the main reasons I advocate NOARCHIVE is because Google’s cache is also a scraping target and indexed by MFA (Made For AdSense) sites which then fight for your own position with your own content in Google.
Additionally, scrapers can bypass spider traps and honey pots installed on your website by avoiding robots.txt by acquiring lists if links from the search engine cache, such as grabbing a copy of your SITEMAP and other things.
So it *IS* because of ‘bots why some of us disable CACHE and using LOGINS or CAPTCHAS to control security from bad ‘net neighborhoods has nothing to do with cloaking.
Doug Heil Said,
March 4, 2007 @ 11:04 am
>>The fact is, does it really matter what your definition is if no one uses it for the basis of whether or not a site gets banned or penalized? You can make up any definition you want, but unless its actually used by the people who matter (ie Matt Cutts & Google) then what use is it?
Huh? Where did I say anything of the sort?? I did not say that cloaking is whether or not a site gets banned or penalized. LOL
What I said is that cloaking is “very” specific. It’s detecting a bot for the SOLE reason to rank in the serps, and sending that bot to a page other than Sally would see. Brett is not doing that.
>>Cloaking detecting SE bots via IP, UA, or both and serving them different content than everyone else.
Well yes, but in Brett’s case there are real reasons for doing that. He is also doing many more things than that as he does allow users already subscribed to view the threads. He is simply allowing Google to view the threads as well. It’s not like Google does not know this so it is not deceiving Google whatsoever. That is not the definition of cloaking and never has been.
I mean, let’s get real here; You know I’d be the very first to say … WMW is cloaking if that were indeed the case. Cloaking is “always” search engine spam. WMW is not cloaking. WMW is not doing what they are doing because they want to deceive Google, and not doing what they are doing to gain some kind of serp advantage. I think we know those threads would be up there in the SERPs no matter what was going on. sheesh Brett is no dummy.
>>If you’re serious about long term SEO do you really want to be supporting a pushing of the envelope in regards to black hat techniques?
I think we all know how serious I am about this industry. I also think we all know I do not support blackhat techniques…. well Duh?? LOL
We will have to agree to disagree about the definition.
IncrediBILL Said,
March 4, 2007 @ 11:20 am
As a matter of fact, us fat boys have long memorys, just like elephants and remember things that Matt posted in March 2006 about needing to sign up in WebmasterWorld:
http://www.mattcutts.com/blog/how-to-sign-up-for-webmasterworld/
So Matt, would you care to explain how in exactly 362 days you went from telling people HOW to sign up for WebmasterWorld as you knew they were getting a login page to now claiming they’re CLOAKING based on Philipp Lenssen’s link-bait posts?
Suddenly this entire thread isn’t passing the sniff test.
Doug Heil Said,
March 4, 2007 @ 11:27 am
Bill; Matt didn’t say WMW was cloaking. Or did I read his post correctly?
What I disagree with Matt in his post is that he stated that he is finding that people are cloaking for other reasons other than se spam. Cloaking is spam. If it’s not spam, then it’s not cloaking and simply another form of content delivery.
Chris Said,
March 4, 2007 @ 11:28 am
>>Huh? Where did I say anything of the sort?? I did not say that cloaking is whether or not a site gets banned or penalized. LOL
Where did I say it was? I said the definition we should all be using is the definition used by those who have the power to ban or penalize, IE the only people who matter.
>>What I said is that cloaking is “very” specific. It’s detecting a bot for the SOLE reason to rank in the serps, and sending that bot to a page other than Sally would see. Brett is not doing that.
Yes, he is. He banned all bots in Nov of 2005. You probably remember that, it was discussed in many places. He said at the time he wasn’t worried about losing SERP positions. Then he apparently changed his mind, unbanned bots, and started cloaking instead to regain the SERP positions he lost. That exactly fits your definition.
But you emphasized the word SOLE, so you must think its okay to cloak if you’re trying to get to the top of the SERPs, so long as you’re already trying to do something else with it? Seems a strange definition to me.
>>I also think we all know I do not support blackhat techniques…. well Duh?? LOL
I think you’d like to think that, but I also think you’re merely redefining “blackhat” to suit your opinion.
>> It’s not like Google does not know this so it is not deceiving Google whatsoever. That is not the definition of cloaking and never has been.
Of course Google knows, hence this post, hence Matt Cutts saying WMW could end up banned if they didn’t change their system. I wouldn’t use the fact that “Google knows” as support for your argument when Matt is saying that it is against their guidelines and will result in punitive action.
>>One of the main reasons I advocate NOARCHIVE is because Google’s cache is also a scraping target and indexed by MFA (Made For AdSense) sites which then fight for your own position with your own content in Google.
You mind showing me an example? Just any example of a remotely popular keyword phrase in which a scraper site outranks a legit site with the exact same content?
Not that it isn’t possible, considering many scraping spammers are better at fundamental SEO than the sites they steal from, but Google has gotten a lot better at fighting that type of spam, it also takes only a minute to fill out a DMCA request & a Google spam report in the rare case that a spammer does outrank you by stealing your content. Open up your word template, copy and paste a few URLs, print it, sign it, fax it.
Granted Google does have to do a lot more, specifically hurt the spammers in the wallet by not allowing such publishers into Adsense. But I think you’re blowing the scraper problem out of proportion. Most scraper sites are crappy, ugly, banned shortly after launch, inspire users to click the back button, and gain no quality incoming links. If you’re having a problem beating them in SERPs on keyphrases you’re
This all being said, IMO a captcha would be alright. Require people to do a captcha or human check of some sort, give SEs a free pass. There is no benefit to the webmaster, unlike require registration, which gives the webmaster another user, lead, email to market to.
JLH Said,
March 4, 2007 @ 11:30 am
The reasons are irrelevant, some are for good some are for bad. What you call it doesn’t matter at all.
The text on the snippet shown in Google’s results should be viewable on the page they are delivered to that’s it. If the page doesn’t do that then it shouldnt be indexed much less ranked in the results page.
I think the real lesson here is that there is no automatically detectable way for Google to find cloaking and it takes 6 months of blog posts on a fairly popular blog to inspire at least a manual review. So for all of you wanting to cloak/redirect/etc don’t worry unless someone with a lot of visablity blogs about it, and then you still have at least half a year to build your mailing list and make the hard sell.
Doug Heil Said,
March 4, 2007 @ 11:34 am
I wish I could find the thread with PhilC talking about this same thing.
We do agree on the disclaimer though. That’s a good thing so let’s build on that.
Chris Said,
March 4, 2007 @ 11:43 am
The problem with a meta tag driven disclaimer or something else is that it’d be difficult to impossible to police. I don’t think Google wants to manually review every site in their index.
Undoubtedly, despite what some people think, Google does have automated cloaking detection systems. If you implemented a system whereby a webmaster indicated they were cloaking for such-and-such okay reason and Google should display a notice in the SERPs, then all a spammer would have to do is issue such a notice as well and be immune to any automated detecting system. In short, it requires complete trust in the honesty of webmasters, and you just can’t trust people like that (see AltaVista circa 1997). This would result in more spam that’d need to be manually reviewed and removed.
The other option is to have Google manually assign such indicators, but again, this is a manual review system, something I doubt Google would go for, and all the spammer would have to do is wait until after his review to start his nefarious activities.
The best solution is for webmasters to think of better ways than cloaking to solve their problems.
J de Silva Said,
March 4, 2007 @ 11:45 am
Thank you for finally addressing this issue about “cloaking” and WMW.
I have read Brett’s comments on the subject and I can understand why he thinks it’s necessary.
I propose we look at the real issue — the very (VERY) real issue of rogue bots harvesting any and every content found on the http://WWW.
I think Google should not consider this cloaking. Things have changed so much in the last couple of years that it is now normal to have every unimaginable bot on your web site at any given time. It’s mad. Some of these bots are so badly written, they aggresively fetch up to 5 pages a second — for hours! Then there’s the “referrer-spam” bots that are simply out of this world!
I think Google should re-consider lumping this together as cloaking. Maybe Google should come up with an acceptable guideline for webmasters who wish to block or prevent such abuses.
Maybe re-directing to a login page is not acceptable. Maybe what is acceptable is a re-direct to page explaining that the IP is banned for repeated abuse, used by a rogue bot in the past, etc. and how a regular user may view the content (i.e. requires registration or login), and a link to contact the webmaster for assistance.
I must admit honestly, that I have had to implement something similar recently to limit the abuse on my web site. However the algorithm I wrote also automatically white-lists traffic from banned IPs to allow them access to the site if they are found to be real users.
IncrediBILL Said,
March 4, 2007 @ 12:07 pm
Doug, the whole article is mostly about WMW and their possibly cloaking and consideration for being dumped from Google for simple site security options. Just because you can access WMW freely from one location and not others does NOT make it cloaking nor does it justify WMW being the center piece of a cloaking article that implies they were being subject to being removed for protecting their intellectual property and being able to maintain adequate bandwidth and server performance for all members, many free and some paid subscriptions.
Whether WMW was actually dumped from Google now isn’t the issue as it’s obvious they’re under scrutiny. My problem is with what happens the NEXT TIME Matt clicks a link and either isn’t a) logged into WMW or b) is trying to access WMW from a bad ‘net neighborhood.
Does WMW get the boot from such a simple misunderstanding?
Will I get the boot for doing almost the same identical thing with my site?
How does Google have the right to tell webmasters not to protect their content from the very sites stealing and using it against us?
Whether scrapers take ONE PAGE, the one indexed and displayed in Google SERPs, TEN PAGES. or a THOUSAND PAGES, the damage can be just as bad regardless of volume. With the imposed rule from Google that you must be able to see at least the ONE page indexed from the SERP, then using a slew of anon proxy servers a scraper can theoretically rip off of an entire site since Google has now forced the front door to be opened in bad neighborhoods because of misguided whining.
What happens when the FIRST PAGE is freely accessible but the same person tries to access the SECOND page from WMW being shown in yet another Google SERP and gets met with a login request, are they cloaking now if it’s a one page limit for free?
OK, so what happens next?
Does WMW have to allow everyone free access to all pages as long as there’s a referral from Google’s SERPS when accessing the page?
If you said Yes, then it’s a WRONG ANSWER because then you’re back to being full-on scraped and attacked as the scrapers can simply stuff faked Google referrers in their requests and it’s off to the races to stop them once again.
Even this old post has many people screaming that WMW is cloaking when it’s nothing but simple site security aimed at certain parts of the internet causing trouble.
http://www.mattcutts.com/blog/how-to-sign-up-for-webmasterworld/
OK, please take some time and narrowly DEFINE CLOAKING!
If WMW did a bait and switch, which was show you one type of content then display another type of content when you access the page, which scrapers and Made For AdSense sites do, I’d agree this is cloaking.
However, WMW simply asks SOME people for a LOGIN, which is not content whatsoever, it’s a SECURITY measure. Accessing the page indexed in Google is still FREE and if they don’t feel like becoming a member then it’s fine for them to walk away and look elsewhere.
When your policy on cloaking potentially interferes with the ability for a service like WMW to actually keep their servers online then you need to review your policy as a LOGIN page or CAPTCHA used to stop abuse is NOT CLOAKING!
To the others that argue you should be able to surf anonymously then go somewhere other than WMW. Trying to tell WMW how to run their business without being able to protect themselves from rogue ‘bots basically is putting WMW up against the wall because if they bow to pressure to remove the login from certain areas the ‘bots can knock them offline.
You can’t surf the site anonymously or otherwise when it’s knocked offline.
Esrun Said,
March 4, 2007 @ 12:11 pm
Interesting debate happening here.
IncrediBILL Said,
March 4, 2007 @ 12:18 pm
Sorry Matt, I was writing hurriedly and wrote the above wrong. Matt didn’t claim WMW was cloaking, Philipp did. However, Matt DID entertain the idea and checked it out with the possibiligy of removing WMW even though Matt posted a year ago regarding the LOGIN page and this was nothing new.
Chris Said,
I think I’ll just do what I’ve been doing until this is officially labeled as cloaking because it’s NOT, it’s a security access method for bad ‘net hoods. Until you run a site getting upwards of 1M visitors a month you have NO CLUE what’s involved in keeping a server under that type of load online when the bad boys come knocking.
Your cloak is my door man.
Chris Said,
March 4, 2007 @ 12:46 pm
I do run sites with 7 figure uniques a month. And I have for years. I’ve had bots slow down my servers, I’ve had site rippers even crash them. I’ve adapted, improved hardware, banned bad ips, and created caching systems. I’ve never cloaked.
g1smd Said,
March 4, 2007 @ 12:47 pm
I use 3 different ISPs to access the net, depending on where I am at the time.
On two of them, WMW asks me to sign in if I arrive there from a SERP, or from a link on another site. On one ISP it does not. From one of those ISPs that does ask me to sign in, I know that thousands of non web-savvy users have inadverently let a myriad of trojans and bots into their systems.
Showing a log-in page to users is NOT cloaking. Showing a keyword-filled page to bots and some other content to humans IS cloaking. A log-in page is NOT content, it is a log-in page.
Brett shows to bots the same page of content that a logged-in user sees. That can’t be cloaking. The bot sees the actual content. If you are not logged-in then you are asked to do so if you look like a bot, or come from an ISP that has a history of abuse.
Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here.
*** I don’t have a “paid” membership to WMW and have always been bothered by any traces of it in the serps. People also link to it from their blogs, you go there and you can’t even signin unless you pay, lame! ***
You don’t HAVE to pay to join WMW. All but a couple of sub-forums are available in the free membership class.
*** I think most people would call what WMW was doing cloaking. ***
I’ll add my name to the “not coaking” list.
As for numbers, I know I can easily get through several hundred “pages” in a day on WMW. Logging in, navigating to and then reading the index page, then the forum thread list, 3 pages of a thread, writing a reply, then editing it, and going back to the thread list already accounts for at least 15 “page” views”.
IncrediBILL Said,
March 4, 2007 @ 1:13 pm
Let’s examine the Google Webmaster Guidelines and HELP for definitions of cloaking and see if there is any reference to a LOGIN or other kinds of web site security in relationship to cloaking.
Nothing deceptive on WMW as users and search engines see the same content. Some users see the page after a LOGIN, so it’s not fooling the search engine, it’s authentication of a human, no rule breaking here.
OK, anyone fooled here?
Many big sites require a login to access their content, and WMW only requires SOME people to login that are in bad ‘net hoods, so LOGIN and you see what’s in the SERP, no fooling!
Again, there is absolutely no mention that a security method is disallowed, just that you shouldn’t write to FOOL search engines which again is not the case on WMW.
Yet again, there’s nothing being shown the search engines that aren’t intended for the user to see.
As a matter of fact, nowhere could I find any reference in the Webmaster Help Center on Google regarding sites that employ security measures or login other than the explain how a 401 (Not authorized) error might cause a page from being indexed.
I think this issue needs to be officially addressed on the Webmaster Help Center before anyone using security to protect their servers gets unduly penalized as cloaking when it’s simply not true.
Chris Said,
March 4, 2007 @ 1:15 pm
>>Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here.
Uhh… yes it is. The threads shown to the SEs are full of keywords. The login page is not.
It isn’t okay just because user registration is free. People are required to sign up, provide an email address, etc. Do you think there is no value in having more registered users and more emails to market to?
Its fine if you want to require registration in your forum, but it is not fine to trick users into visiting there. That is the bottom line, it is deceptive. The content represented in the SERPs is not the content you see when you follow the link.
feedthebot Said,
March 4, 2007 @ 1:26 pm
JLH has nailed my concern on the head when he said …
“The reasons are irrelevant, some are for good some are for bad. What you call it doesn’t matter at all.
The text on the snippet shown in Google’s results should be viewable on the page they are delivered to that’s it. If the page doesn’t do that then it shouldnt be indexed much less ranked in the results page.”
While on this forum I am sure many technical needs and theories will come up and be discussed, but what is important to the general user of Google is…
“Do I find what I want when I do a Google search?”
The answer is no if when someone using Google finds a page that seems to be exactly what they want in a Google snippet and then are directed to another page, registration or any other page. The thing is that Google is not delivering, if a site wants to redirect Google users to a specific page then I feel the only page that Google should index is the page that the users are getting redirected to anyway.
IncrediBILL Said,
March 4, 2007 @ 1:29 pm
Interstitial authentication of human vs bot is not deceptive and people can go elsewhere if they want anonymity.
Besides, you see anyone holding a gun to their head to fill in that form?
If you want the information you may have no choice but to fill in that form or simply look elsewhere. I find links in blog posts to big media sites all the time that ask for me to login for free to gain access to the article.
The only job the search engine is to make me aware of the content in the first place. The choice is MINE, not the search engines, to determine or not to complete the interstitial authentication process or just go away.
g1smd Said,
March 4, 2007 @ 1:40 pm
*** *** Cloaking would be where a keyword loaded page is shown to the bot, and some other content page would be shown to the human. That isn’t what is happening here. *** ***
*** Uhh… yes it is. The threads shown to the SEs are full of keywords. The login page is not. ***
It is interesting that you chose to only quote part of what I said.
Let me get some facts straight.
A log-in page is NOT content.
The bot gets the same content that a logged-in user gets. That is NOT cloaking.
Cloaking would be where a logged-in user saw the thread, but the bot was fed a page stuffed full of keyword lists, a page that a user could NEVER see.
In this case, users CAN see the same page that the bot sees - IF they log in.
–
There are two types of page here:
1. Content that the bot gets, and content that a logged-in human gets. These are identical. There is no cloaking.
2. A log-in page. You get this if you appear to be an untrusted bot, or a user coming from an untrustworthy ISP.
Chris Said,
March 4, 2007 @ 1:58 pm
Well, obviously the people that matter disagree with you. It is clearly against SE guidelines, other sites have gotten banned or penalized for the same thing. Defending it probably isn’t going to work no matter your zeal. Remember, Google’s definition (hint, they matter) says that the content seen when clicking on the link in the SERP should be the same as what Googlebot sees.
I am curious to know… if the subscription wasn’t free, if you had to pay, would it still not be cloaking?
What if you had to sign up for an incentivized affiliate offer to activate your subscription? Still not cloaking?
How many hoops would a user have to jump through to get the content (content that they thought was free and available just by clicking) before you would consider it cloaking? Where are you going to draw the line?
Vic Said,
March 4, 2007 @ 2:17 pm
Matt - you started quite the debate here…
IncrediBILL Said,
March 4, 2007 @ 2:24 pm
What part of my post above did you miss where there is absolutely NO MENTION of interstitial authentication is never mentioned in reference to cloaking?
Not only is it clearly NOT against the SE guidelines, LOGIN is never mentioned.
jimbeetle Said,
March 4, 2007 @ 2:31 pm
One thing I’m very curious about Matt (and since you started the thread, you opened the door for it), just what is Google’s official stance regarding sites like the Wall Street Journal?
Near as I can figure from what I see, Google apparently thinks it’s perfectly okay to return WSJ content in the SERPs, then have the WSJ redirect folks to a “Hey, you gotta’ pay to see this” page.
Just color me curious
Vic Said,
March 4, 2007 @ 2:32 pm
I went back to a post I made on Search Engine Watch almost 2 years ago on this subject and Google’s “First Click Free” program. The following is a snippet of the program details (I don’t know if the program is still in effect as I am no longer with the company that I was offered the program at and I haven’t heard anything of it since but this is right in line with the WMW situation):
If you offer subscription-based access to your website content, or if users must register to access your content, then search engines cannot access some of your site’s most relevant, valuable content.
Implementing Google’s First Click Free (FCF) for your content allows you to include your premium content in Google’s search index. First Click Free has two main goals:
1. Including highly relevant, premium content to Google’s search index provides a better experience for Google users who may not have known that content existed.
2. Promoting sales of or subscriptions to premium content for Google partners.
To implement FCF, you need to allow all users who find your page using Google search to see the full text of the document that the user found in Google’s search results, even if they have not registered or subscribed to see that content. Thus, the user’s first click to your premium content area is free. However, you can block the user with a login or payment request when he tries to click away from that page to another section of your premium content site.
Thus, FCF is designed to protect your content while allowing for its inclusion in Google’s search index.
g1smd Said,
March 4, 2007 @ 2:32 pm
*** Well, obviously the people that matter disagree with you. ***
LOL.
Too Funny.
You’re still wrong.
Multi-Worded Adam Said,
March 4, 2007 @ 2:45 pm
Now you guys see what you did? You see? You woke up Bill! Looks like he’s been drinkin’ that red Kool-Aid he calls beer and he’s on a sugar high, too.
I still haven’t seen one question answered in all of this, so I’ll reask the question using a different tack:
If any of us were to visit a page from a SERP that did not contain the information we sought, or if we were in some way not 100% satisfied with the site containing the page, we would hit the back button on our browser and try a different page from the same SERP, or maybe a different search engine, or maybe we’d go do something crazy like use the phone book or something. In other words, we’d find another avenue.
Why can’t those of you who are complaining do the same thing for subscription-based content, if you feel so strongly about it?
Look at it this way: if you do, Bill won’t be angry about this issue any more and he’ll be down to 999,999 more things to be legitimately pissed off about.
IncrediBILL Said,
March 4, 2007 @ 3:03 pm
BTW, just thought I’d point out that this is patently false and I have almost 12 months of scraper data to back up my claims which you’re more than welcome to see any time. The most offensive scrapers just cut ‘n paste actual user agents from log files, real user agents.
As a matter of fact, some site downloaders comes with IE 6 as the default UA, but thanks for playing.
Tim Linden Said,
March 4, 2007 @ 3:18 pm
Wow Matt. How do you get any work done? I got 1/10th of the way down and realized I have to stop reading all the comments =P
Doug Heil Said,
March 4, 2007 @ 3:32 pm
g1smd wrote:
>>>The bot gets the same content that a logged-in user gets. That is NOT cloaking.
Cloaking would be where a logged-in user saw the thread, but the bot was fed a page stuffed full of keyword lists, a page that a user could NEVER see.
In this case, users CAN see the same page that the bot sees - IF they log in.
Exactly.
Gee Chris; I’m finding that there are many, many in our industry that seem to have the same definition of cloaking as myself. Not only quite a few in this thread, but quite a few others out there who have not wrote anything yet.
Yes Bill; I didn’t think Matt claimed WMW was cloaking and know the article is kind of claiming as such. I also believe the Google definition of cloaking is exactly my definition as well. They certainly wouldn’t want to “expand” the definition to include all forms of content delivery.
Cloaking is always search engine spam.
The answer to the member’s question about WSJ is that it is “not” cloaking as it’s basically the same thing. They are showing all user agents who are “not” subscribed the sign-in page. They are showing all bots and users who are subscribed the content page. Not cloaking.