Call for Papers: AIRWeb 2007

I’m on the 2007 program committee for AIRWeb, which is the workshop on Adversarial Information Retrieval that will be held May 8th 2007, in conjunction with the WWW conference up in Banff. One big change from last year is a labeled testset that people can use for their webspam research. I’ll include the call for papers:

Third International Workshop on
Adversarial Information Retrieval on the Web


Track I of the Web Spam Challenge 2007


15/Feb/2007 : Deadline for research articles
30/Mar/2007 : Deadline for challenge submissions
8/May/2007 : Workshop at the WWW 2007 conference in Banff, Canada


1. AIRWeb’07 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee


Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from collections
wherein a subset has been manipulated maliciously. On the Web, the
predominant form of such manipulation is “search engine spamming” or
spamdexing, i.e., malicious attempts to influence the outcome of ranking
algorithms, aimed at getting an undeserved high ranking for some items
in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest include,
but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging

Proceedings of the workshop will be included in the ACM Digital Library.
Full papers are limited to 8 pages; work-in progress will be permitted 4

For more information, see


This year, we are introducing a novel element: a Web Spam Challenge for
testing web spam detection systems. We will be using the WEBSPAM-UK2006
collection for Web Spam Detection

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set of
features extracted from the contents and links in the collection, which
may be used by the participant teams in addition to any automatic
technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2007

For more information, see


– 7 February 2007: E-mail intention to submit a workshop paper
(optional, but helpful)
– 15 February 2007: Deadline for workshop paper submissions
– 15 March 2007: Notification of acceptance of workshop papers
– 30 March 2007: Camera-ready copy due
– 30 March 2007: Challenge submissions due
– 8 May 2007: Date of workshop



– Carlos Castillo, Yahoo! Research
– Kumar Chellapilla, Microsoft Live Labs
– Brian D. Davison, Lehigh University

Program Committee

– Einat Amitay, IBM Research
– Andras Benczur, Hungarian Academy of Sciences
– Andrei Broder, Yahoo! Research
– Soumen Chakrabarti, Indian Institute of Technology Bombay
– Paul-Alexandru Chirita, University of Hannover
– Tim Converse, Yahoo!
– Nick Craswell, Microsoft Research
– Matt Cutts, Google
– Ludovic Denoyer, University Paris 6
– Aaron D’Souza, Google
– Dennis Fetterly, Microsoft Research
– Tim Finin, University of Maryland
– Edel Garcia, Mi
– Natalie Glance, Nielsen BuzzMetrics
– Antonio Gulli,
– Zoltan Gyongyi, Stanford University
– Monika Henzinger, Google & Ecole Polytechnique Federale de Lausanne (EFPL)
– Jeremy Hylton, Google
– Ronny Lempel, IBM Research
– Mark Manasse, Microsoft Research
– Gilad Mishne, University of Amsterdam
– Marc Najork, Microsoft Research
– Jan Pedersen, Yahoo!
– Tamas Sarlos, Hungarian Academy of Sciences
– Erik Selberg, Microsoft Search Labs
– Mike Thelwall, University of Wolverhampton
– Andrew Tomkins, Yahoo! Research
– Matt Wells, Gigablast
– Baoning Wu, Lehigh University
– Tao Yang,

34 Responses to Call for Papers: AIRWeb 2007 (Leave a comment)

  1. Matt

    Thats very interesting workshop indeed, especialy that part dealing with cloaking. May be you wish to elaborate more about whether you are satisfied with Google current situation on fighting back on cloaking. Reason I ask is that its still there on Google serps 🙁

    Why not ask for feedback to report cloaking withing sitemaps.

    Wish you a great day and successful week.

  2. Major problem with this workshop’s theme is the usage of the word SPAM as an umbrella term – and the tendancy to place all spam on equal footing.

    While many of the topics being addressed are mostly a result of malicious behavior, CLOAKING is not necesarilly all malicious or even irrelevant, and what ‘specifically’ is being interpreted as LINK SPAM?????

    There needs to be more of a human element in that conference; there seems to not be an interest in discovering what factors drive basically good people to use gray hat techniques 😕

    While it is good to apply technology to automatic protection and cleansing of the Searchable Web, it is also vital that Search Engines make proactive gestures to communicate with SEOs and Webmasters who feel that they have NO other choice than but to use certain controversial tactics to get noticed. Not all of them start off with malicious intent.

    To always address voids and malcontentment with technology shows only a band-aid approach to redressing matters.

    😮 h, SearchEnginesWeb

  3. SEW

    “CLOAKING is not necesarilly all malicious or even irrelevant”

    The cloaking I’m talking about is:

    You show the user something while you show Googlebot something else. As such webmasters and site owners have many other choices within ethical SEO than to follow such unethical cloaking method 😉

  4. Thanks for the data, I’m sure I can find a use for it … (evil harpsicord music and lightning in the background) 🙂

  5. I appreciate the heads up, Matt, but I’m busy that week. You and your little friends will have to stumble along as best you can without help from the pros.

    Have fun!

  6. graywolf

    “Thanks for the data”

    Any thing you wish to share with the rest of us, if not here on Matt’s blog, let it be on TW:

    Matt Cutts Pass Secret Google Data to The GrayWolf 🙂

  7. Banff? hmm, something tells me I should stock Cutts while he visits the candy store.

  8. Harith, nothing secret, if you go to the site matt linked to they give you a dataset of URL’s some of which have been tagged as spam, others tagged as good others tagged as undecided (I think that was the term but if not it exactly means the same). Now while I may think is nothing but MFA spam and the NYT is cloaking, google on the other hand disagrees, and well since google holds the traffic they get to make the rules. Since they are saying it’s a set to work with I’m proceeding from the point of “someone looked at it and generally agrees with it”. I’ve skimmed the data but haven’t run any hardcore numbers on it yet.

    That said I do assume there are a couple false positives and false negatives that were hand altered. They should have seeded the set so they can see what you turned up. That’s what I would do if it were my data and I wanted to see how good you were.

  9. Talking of spam, is all that duplicate content ?

  10. * Advertisement blocking

    Available and a highly advised plugin for the much touted FireFox.
    Sure is nice!!

  11. Funniest thing though, if it wasn’t for paid advertisement none of that junk would exist.

  12. What’s Gigablast??

  13. Harith writes: The cloaking I’m talking about is: You show the user something while you show Googlebot something else. As such webmasters and site owners have many other choices within ethical SEO than to follow such unethical cloaking method

    Apologies if this is a naive question. I am new to this game. But let’s suppose I have written a long paper or in-depth magazine article or thesis or something like that. Naturally such a long document is going to touch on a number of different, minute sub-topics. Though the focus of the document will really be on only a single topic, the writer might spend some time filling out background material, histories, related works, and such. From a search engine’s perspective, the topic of the document will appear to be pretty diffuse…which I presume is not a good thing for ranking.

    So typically what humans do, to help other humans understand such a long document is write a synopsis, summary, or abstract. This is a concise statement of what the document is about.

    So what is wrong with showing Googlebot the summary or abstract, and then showing the end-user the full article? Let us presume that you are not employing any other sort of spam techniques, such as keyword stuffing or whatever. Let us presume that you are showing Googlebot the same synopsis or abstract that you are showing to the human. It is just that, when you actually show the page to the human, you don’t just show the abstract, you show the full document. But you do only show Googlebot the abstract.

    What is wrong, then, with this? It is still cloaking, right? And should be banned? But how is it different than getting mad at your Tivo, because it only shows you (and lets you search) the summaries of the shows…even though by the time your Tivo actually shows you the program you see the full show? Cloaking, n’est pas?

  14. Dublicate content is’nt necessary spam imo. Dublicate content is just stupid, while spam is more of a strategy (if you can call it that without Matt tearing of my head :-D). But that’s just me 😉

  15. Don’t you just love this cloak and dagger stuff? And here’s detective Matt Cutts with his huge magnifying glass smoking a cigar. 🙂

  16. I have to agree with graywolf about NYT. They should not be listed in the search results. Nor should other sites that are obviously cloaking and creating a very bad user experience. Another one that needs to go is Project Muse. If searchers cannot get into these sites, they should not be listed in the search results. Same for ACM.Org. These sites add absolutely no value to Google’s search results for the general, mainstream population.

    That would be the majority of searchers who cannot get into these closed sites.

  17. graywolf

    Thanks for your spirit of sharing. Very interesting observations and anlysis indeed!

    I do hope that Matt and the friends at WebSpam Team would pay much attention during 2007 to the problem of cloaking.

  18. If you do a top-5-list-of-things-you-don’t-know-about-me-and-link-bait I will cry, no, baby Jesus will cry.

  19. Wow, I really love to attend workshops in SEO, but so sad most of them are held abroad. I just hope someday workshops will be help in our country the Phillippines.

    Thanks for the information..

    BEst Regards,

  20. interesting

    Adversarial Information Retrieval – is that a posh word for hacking

  21. >>>”Adversarial Information Retrieval
    addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is “search engine spamming” or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection.”

    If you guys are looking for some help with getting rid of spam, instead of calling for ‘free’ papers, maybe you should run an ad at TW… 😉

  22. That seems like a very interesting workshop. The Web Spam Challenge seems like it’d be fun to participate in. Maybe one of these days I’ll try attending one of these workshops.

  23. Well at least Matt isnt talking about again!

  24. Sadstate: a tier 2 search engine, and arguably the best tier 2, although it has issues with people who think their crawler is some invasive bot looking for stuff it shouldn’t. is the URL.

    Side note: does anyone else see two people shapes in the hostgraph on this page?

    They’re both toward the top and beside each other. One person’s holding his/her arms out with his/her legs kind of bent at the knees but close together, and beside him/her to the right is a person clutching his/her hands as if he/she just won something.

    (Or am I being too artistic for my own good?)

  25. But Michael, What you describe is not cloaking. It’s showing user agents who are already subscribed, the content therein. Those not subscribed get the “subscribe” page. It’s not cloaking at all. Those sites are simply allowing se’s to be already subscribed. User agents who are also subscribed get the same content. Not cloaking. Not spam.

    Of course none of us want to see those type pages in the SERPS. That is an entirely different subject though.

    The main point as it relates to this thread is that showing user agents who are not subscribed a page that asks them to subscribe is NOT cloaking.

  26. C’mon you don’t expect me to buy that malarky do you? Showing the SE’s one thing so you rank and showing me the user crap is nothing more than a bait and switch scheme to pry money out of my wallet. If I really wanted the info I could log into my library which has has free online databases for anyone with a library card, so it’s the point of the matter not the actual data.

    The reason they get away with it is because they are the new york times period

  27. Cloaking (which some sites do — they use IP detection to let the spiders in — no subscriptions involevd) or baiting and switching, call it what you will. The point is that the search engines are serving irrelevant results that adversely affect the user experience. Since I cannot get into those sites I don’t want to see them in my search results. I should not have to click past them to get to meaningful, useful content.

  28. I certainly agree. And why I say that’s a whole other issue. The fact is, it’s still not “cloaking”. It is deceptive however and NO search engine user I know of wants to see a login page when expecting to see free content in the free organic results. I’m not disagreeing with that at all, and Google should do something about it. It’s not a cloaking issue however. It’s a bait and switch and deception issue.

    Myself and PhilC agree on this issue of what cloaking is and is not. It’s not often over the last ten years him and I actually agree on something, so you know we are not full of it….. this time. 🙂 Some members in this thread are “elders” in this industry, so I think it’s high time we actually agree on the real definition of cloaking. There are many tiypes of content delivery. Cloaking happens to be a very specific type. That type is always search engine spam. When Google detects the region of a user agent, Google ain’t cloaking. When other sites do the same thing, it’s not cloaking. When my site detects whether or not a user agent has flash installed and directs them to the html site or the flash site, that’s not cloaking either. All are forms of content delivery. The only form that is spam “all the time” is cloaking. The other forms “could” be spam if the intent is there, but they all have their best practices place as a type of content delivery. What the NYT times is doing,… what the WMW forums does,… and even what I could be doing in my own forums if I wanted to do so, is “not” cloaking. Deceptive to search engine users? Oh yes.

  29. In fact Matt did mention someting about IP delivery and Cloaking 😉

    Her is what Matt wrote !

    Here’s the short answer from Google’s perspective:

    IP delivery: delivering results to users based on IP address.
    Cloaking: showing different pages to users than to search engines.

    IP delivery includes things like “users from Britain get sent to the, users from France get sent to the .fr”. This is fine–even Google does this.

    It’s when you do something *special* or out-of-the-ordinary for Googlebot that you start to get in trouble, because that’s cloaking. In the example above, cloaking would be “if a user is from Googlelandia, they get sent to our Google-only optimized text pages.”

    So IP delivery is fine, but don’t do anything special for Googlebot. Just treat it like a typical user visiting the site.

  30. Yes; very good. And since the NYT is not treating Google any more special than a user who is subscribed, it’s fine, and not cloaking.

    I call those type of pages in the SERPS as being “search engine user spam” as they deceive the search engine user. When you deceive the googlebot, it’s called “search engine spam”. Google is not being deceived by the NYT. The NYT is deceiving search engine users.

  31. When Google starts showing people what they’ll see when going to the New York Times, ACM, Project Muse, et. al. then I’l agree they are not cloaking.

    In the mean time, I don’t get my information (certainly not my definitions) from Phil.

    The request for Google to stop showing deceptive results from those sites stands. Personally, I don’t care what anyone calls them. They are misleading, they give the unsuspecting user a bad experience, and they are just plain annoying to those of us who don’t want to have to skim over them in the SERPs.

    Google should make showing closed content optional, not a requirement.

  32. Hi Michael, You are not getting the point. Cloaking is a specific form of content delivery that deceives the “search engines”. Where is Google being deceived?

    It’s the search engine user being deceived, right?

    Cloaking is “always” search engine spam.

    I don’t know how else to explain this. Maybe others can explain this more clearly?

    Both Google and NYT are deceiving the search engine users. If Google is not being deceived, it cannot be cloaking.

  33. Hi Matt
    Since Dr Edel Garcia is included in Program Committee could you give us a comment on his SEO methodology, particularly EF ratios and c-index.

    Tank you in advance.

  34. Thank you for submitting this sorted data, it did create a fuss of more then a 100 comments lol. A very interesting read indeed.