How Cuil generates its categories

This “hairball” post about Cuil isn’t really snarky, so I’ll post it. Cuil is no longer around, but it did spawn a funny post on Reddit about Cuil Theory.

Cuil launched this week. For a search engineer, a new search engine is like a Christmas present: you can’t wait to play with it. Most search engineers can get a good feel for the strengths/weaknesses of a new engine within 10-15 queries. And I’d like to think that with another 5-10 queries, I can usually figure out how I’d spam a search engine. It’s my job to protect Google’s index from spam, so naturally I’m intimately familiar with different webspam techniques. :)

What’s also fun is to figure out the how a search engine provides various features. For example, for a Cuil search like [matt cutts] you’ll see the following categories:

Cuill categories

Where do those categories come from? Most people didn’t drill down that far, but it’s quite doable to figure out. If you want, take a few minutes to see if you can puzzle out how the categories are generated before reading on.

Google OS figured it out, for example: “Another interesting idea is the explorative category section that shows related Wikipedia categories and topics.” With a little work, it’s easy to verify that the right-hand box comes from Wikipedia category pages. For example, the string “matt cutts” occurs on the Wikipedia page for search engine optimization, and that page also includes a link to a search engine optimization consultants page. Sure enough, one of the categories listed for [matt cutts] is “Search Engine Optimization Consultants” and the entries under that category are from Wikipedia. Likewise, I think the Wikipedia page for Traffic Power and its link to a category page for black hat SEO probably accounts for why the category “Black_hat_seo” appears for my name.

There’s nothing wrong with surfacing Wikipedia category pages, of course, but sometimes that can lead to some drift in topicality. For example, p2pnet wrote about a search for their name: “[The search query] p2pnet.net, however, gave Canadian copyright law, Project Gotham Racing Series, file sharing networks, Wired magazine people, and filesharing programs.” You can see the categories for the search [p2pnet.net] below:

Cuill categories for p2pnet

And this Wikipedia page has the string “p2pnet.net” and also has a category page for “Project Gotham Racing series”. The idea of surfacing Wikipedia category pages will have advantages and disadvantages depending on the user and the query.

3 Responses to How Cuil generates its categories (Leave a comment)

  1. Hey Matt, I must thank to you in advance for publishing this comment (if you approve it)
    I got a (may be stupid) question for you: will you visit Indonesia some day?
    Why I ask this? Because I believe there will be so many people all over the world (including from Indonesia) want to see you in the real life and we have so many questions about SEO and things relevant to Google.
    If you intend to spend your holiday in our country, please let me know. I’ll be glad to be your tour guide to the last land of the endangered Javan Rhino: UK National Park
    With best regards and warm greeting from Indonesia

    Mr. Sharz / admin of Direktori Blog Indonesia

  2. It’s ironic that you are, in theory, the best spammer in the World – and the best preventer of spam!

  3. We deal with SEO through improving web content. Do you feel that the introduction of new search engines will cause a major shift in user search engine preferences, or does it generally fall by the wayside? We are excited to get to know about Cuil more. Thank you for the post.

Leave a Comment

Your email address will not be published. Required fields are marked *

*

If you have a question about your site specifically or a general question about search, your best bet is to post in our Webmaster Help Forum linked from http://google.com/webmasters

If you comment, please use your personal name, not your business name. Business names can sound salesy or spammy, and I would like to try people leaving their actual name instead.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php