AIRWeb 2007 papers available

AIRWeb 2007 has released which papers they’ve accepted; Bill Slawski has posted the full list of papers, with links to the papers, over at Search Engine Land. I was on the program committee and helped review papers, but I’m not sure whether I’ll be able to make it to the WWW conference or the AIRWeb workshop. If you’re interested in webspam, these papers are fun to read.

20 Responses to AIRWeb 2007 papers available (Leave a comment)

  1. By the way, if you run a periodic event like a conference, here’s an SEO tip: make one site for your event, and then add subdirectories or subdomains for each event. For example, which of these is a better idea?
    http://www.siggraph.org/s2003/
    or
    http://www.sigir2003.org/

    At first glance, you might think the latter is easier to remember. But someone who only visits your site once or twice a year probably wouldn’t remember the domain anyway. The subdirectory approach is easier to maintain because your pages are all on one site, and you only have to renew that domain (as opposed to renewing domains like sigir2003.org, sigir2004.org, etc.). Finally, if you run a popular event, you run the risk of people trying to get ahead of you and registering the logical name of one of your future events (CuttsCon2008 or whatever). Better instead to put your events under one domain.

  2. It still boils down to a band-aid, ‘Us against Them” strategy…

    This hi tech ping pong game will go on for years – with adversaries constantly ‘one-uping’ each other…

    The real answer lies in taking a DNA approach, to discover the voids and frustrations that eventually force some ‘Good’ Webmasters to do these things. Not the Bad Webmasters intent on harming, but the masses who resort to these tactics.

    It is only fair to address WHY

  3. Using Spam Farm to Boost PageRank by Ye Du, Yaoyun Shi and Xin Zhao sounds good. Good time to compare all this with previous papers.

    Thanks
    Matt

  4. Any guesses on the webspam detection contest? I would have loved a shot at that, but I don’t really have the horsepower to do any useful analysis.

  5. Thank you, very interesting, especially “Combating Spam in Tagging Systems” – good to see that people are recognising we can’t keep treating them like standard symantics. Haven’t read “Web Spam Detection via Commercial Intent Analysis” yet, but it sounds like a very good idea, and something you can propably implement relatively easily over at Google.

    Out of interest, what journals do papers on webspam normally get submitted to? (not my official field, so I haven’t come across many of them)

  6. But Matt if you register a domain like sigir.org and then if you create folder, so the url will be sigir.org/2003/ – what is the problem

  7. Matt

    Talking about linkspam.

    Here is a suggestion for your consideration.

    – Disable PR on Google toolbar.

    – Sites interested in having Google toolbar to display the PR value of their site have to subscribe and accept/agree on adding rel=nofollow to all paid links, in addition to provide human readable disclosure that a link/review/article is paid.

    Thoughts?

  8. Web spam… I don’t know if you remember me Matt but almost two years ago when this blog was new you saved my website and many thousands of hours of my work by informing me that my site had been reincluded–I was about to abandon it because it had been banned as spam.

    Well, it has happened again my Google traffic has abruptly stopped Saturday though my site is still in the index. I do follow your blog but I don’t have time to keep up on what the details of spam are, I do my site and it gets continually better. Well, it would if it got any traffic, I will abandon it if you think it’s spam and your searchers will be denied my work.

    Over the past two years I’ve done very well as an AdSense publisher and Google stockholder but I guess I’ll have to find a new occupation. I would appreciate your giving my site just one more look before letting it die. Thanks

  9. Andi, I’m not connected with Google but my advice would be that 1) it is still in the index, so google think it is worth people visiting it. i.e. The problem is with how well it is ranking, so it just needs some good old SEO. I would say:

    Cut down on the number of links on a page, and add more (unique) text about each site being linked to. There is a lot of talk about google trying to remove search results and directories from coming up, as they want the search to take people direct to a shop. If you are really adding some value to the user by giving detailed informative reviews of sites then google’s algorithms will be more likely to return your search results.

    Have you used Google Webmaster Console?
    http://www.google.com/webmasters
    If there is a problem with your site it will tell you there. It will also let you see how many times your pages have come up in search results but have not been clicked on, you may find that people are just clicking on your result in the SERPs less than they used to.

    Hope this helps, doubt Matt can really give SEO advice, so I thought I would give my two cents.

  10. Thanks for your comments Tim. What you are suggesting would require a huge effort with out any assurance of a return.

    I’m not looking for advice, simply stating my position. If Google thinks my work is spam I’m outta here. If their policy is a guessing game I can find better things to do with my time, I’ve travelled that SEO road once before and it’s just not worth competing with the spammers, it’s not what I do. Many people have used my site and found it worthwhile but I can serve them better offline if Google thinks it’s spam.

    It would be a shame to flush all that work down the drain but it’s their choice not mine.

  11. I should add that I’m not too arrogant to make changes I simply am fed up with playing silly games and being kept in the dark. The site paid well for a time but if it’s over, it’s over. I am just asking that Matt give it one last look before pulling the plug.

  12. Andi

    “The site paid well for a time but if it’s over, it’s over.”

    That site has served you well and generated a lot of $$$ for you. Time now to let it rest in peace. God rest its sole.

  13. >>> God rest its sole. (shoe directory too…)

    It is now the best edited, most comprehensive, most up-to-date women’s apparel directory on the web, I should know I’ve spent years building it and surveying the competition.

    Women’s apparel is a huge industry, larger than search, and not dominated by one name.

    I don’t need Google to monetize my talents or this database though its creation was underwritten with AdSense dollars and it has been a sweet ride with Google, I’m sad to see that end.

    So the loss belongs to the people who search for clothing on Google. When GOOG hits 505. I’ll cash out and travel in Europe for a while, maybe with my database. I’ll be fine, though I do feel sorry for Google’s clothing searchers.

  14. Andi

    “I’ll cash out and travel in Europe for a while, maybe with my database.”

    Good Idea. We have very nice weather at present.

    Honestly, that site of yours has spent its life carrying so many links on its shoulders. It couldn’t take it any more. Passed away suffering of Linksfobia 🙂

  15. Harith your sick obsession with death is disgusting, you haven’t a clue.

  16. Andi

    Ok. No more Linkfobia talk. Here is some serious things which you might wish to consider.

    Since the introduction of Google new infrastructure BigDaddy, we have been intrioduced to successive data pushes / data refreshes which can/might affect some sites rankings in remarkable way.

    Some webmasters have reported that their sites keep back and forth moving on GOOG’s serps.
    In addition we we really don’t know for sure whether the friends at the plex have introduced new algos or filters which cause the “unstability” of ranking/indexing of some sites.

    Maybe your site will bounce again within a week or so. Good news, right 🙂

  17. Harith I have already written off that web site and am reconfiguring the database for a post-Google business. If traffic returns that will be a gift but I am moving on. I am one being driven from the web entirely by Google’s “instability.” Maybe next decade. Thank you for your interest.

  18. This one is full of misinformation:
    http://www2007.org/workshops/paper_116.pdf

    “One trick that some webmasters play is that they announce they will only accept link exchanges from sites that are topically related, and not entertain link exchange requests from topically unrelated sites. Such sites are sill [they did not spell check their article evidently] participating in link exchanges, and in some cases, such web sites end up dominating the top search ranks for queries related to the topic”.

    The article implies that sites that link exchange with relevance for the end user are “spamming”. That’s outrageous.

    Nowhere in the article is there a mention of “editorial discretion”, “editor”, or any words synonymous with editorial discretion on making links. The article implies that anyone exchanging links for the end user is “spamming”.

    Lets hope that Google engineers are smarter than this. The article in itself sends the wrong message to webmasters who are making linking decisions for the end users.

    I’ll wager that the kids who wrote this article have never marketed a new site on the web in their lives.

  19. The AIRweb papers for 2007 are big on link analysis and related graph theory. I’m not sure that link analysis means much for moderate-traffic business sites any more. For too many sites, most inbound links were created for promotional purposes. It’s possible to detect those. But where would one find “legitimate” inbound links to, say, a local plumber? Inbound links will probably be from directories, blogs, or ads, all of which are spammable. Links thus indicate something significant when they link to useful information, but not when they link to businesses.

    The paper on “Web Spam Detection via Commercial Intent Analysis” is straightforward. Detecting “commercial intent” works just like Bayesian filtering of e-mail spam; you train on some commercial content and some non-commercial content, and build a classifier. As the paper points out, Yahoo has had that up for years. It’s easier than classifying e-mail, since sites that sell something have to look like they’re selling something or nobody will buy.

  20. A follow-up on my slightly off-topic comments above. I have exited my recent “Google hell,” for good I hope.

    If by automation congratulations on a superior though flawed algorithm, if by human intervention thank you for a sagacity surpassing superior bot-love.

css.php