How Cuil generates its categories

This “hairball” post about Cuil isn’t really snarky, so I’ll post it. Cuil is no longer around, but it did spawn a funny post on Reddit about Cuil Theory.

Cuil launched this week. For a search engineer, a new search engine is like a Christmas present: you can’t wait to play with it. Most search engineers can get a good feel for the strengths/weaknesses of a new engine within 10-15 queries. And I’d like to think that with another 5-10 queries, I can usually figure out how I’d spam a search engine. It’s my job to protect Google’s index from spam, so naturally I’m intimately familiar with different webspam techniques. :)

What’s also fun is to figure out the how a search engine provides various features. For example, for a Cuil search like [matt cutts] you’ll see the following categories:

Cuill categories

Where do those categories come from? Most people didn’t drill down that far, but it’s quite doable to figure out. If you want, take a few minutes to see if you can puzzle out how the categories are generated before reading on.

Google OS figured it out, for example: “Another interesting idea is the explorative category section that shows related Wikipedia categories and topics.” With a little work, it’s easy to verify that the right-hand box comes from Wikipedia category pages. For example, the string “matt cutts” occurs on the Wikipedia page for search engine optimization, and that page also includes a link to a search engine optimization consultants page. Sure enough, one of the categories listed for [matt cutts] is “Search Engine Optimization Consultants” and the entries under that category are from Wikipedia. Likewise, I think the Wikipedia page for Traffic Power and its link to a category page for black hat SEO probably accounts for why the category “Black_hat_seo” appears for my name.

There’s nothing wrong with surfacing Wikipedia category pages, of course, but sometimes that can lead to some drift in topicality. For example, p2pnet wrote about a search for their name: “[The search query] p2pnet.net, however, gave Canadian copyright law, Project Gotham Racing Series, file sharing networks, Wired magazine people, and filesharing programs.” You can see the categories for the search [p2pnet.net] below:

Cuill categories for p2pnet

And this Wikipedia page has the string “p2pnet.net” and also has a category page for “Project Gotham Racing series”. The idea of surfacing Wikipedia category pages will have advantages and disadvantages depending on the user and the query.

It’s time to stop PROTECT IP

A couple months ago, I wrote this about SOPA:

SOPA galvanized the tech community, from start-ups to venture capitalists to the largest web companies. SOPA was an unexpected shock and a wake-up call. Well, guess what? Now the internet is awake. And I don’t think it’s going back to sleep any time soon. We might need to rally again in the near future, but we can do that. The internet learns fast.

Now it’s time to rally and get loud. It’s time to call your Senators. Heck, it’s time to ask your parents to call their Senators. If you think the internet is something different, something special, then take a few minutes to protect it. Groups that support SOPA have contributed nine times more money in Washington D.C. than our side. We need to drown out that money with the sound of our voices. I’d like to flood every Senator’s phone, email, and office with messages right up until January 24th.

If you need a quick refresher about why the Stop Online Piracy Act (SOPA) and PROTECT IP Act (PIPA) are horrible ideas, Google did a blog post talking about how SOPA and PIPA will censor the web and won’t stop actual pirates. Or read about how capricious takedowns can cause serious collateral damage. Find out how real, legitimate companies can be run out of business.

What you can do?
It’s time for action. Call your Senator right now. Spread the word to your friends and family. Promise not to vote for politicians who support SOPA. Print out some PDFs and post them at work or on your campus. There’s also protests and meetups happening today in New York, the Bay Area of California, and Seattle. Don’t live in the United States? You can still petition the State Department at americancensorship.org.

This is it. You want to look back months from now and know that you did everything you could to protect the internet. Call your Senators, educate your friends and family, and please spread the word about PROTECT IP and SOPA as widely as you can.

But if you can only spare five or six minutes, please call both of your senators below:


Thank you!

Progress against SOPA

When I did my blog post about the Stop Online Piracy Act (SOPA) last week, things looked quite grim. The fight isn’t over, but there’s been a lot of great developments in the last few days. If you’re not familiar with SOPA (and the PROTECT IP Act in the Senate), here’s a video that covers the basics:

This internet censorship under SOPA editorial by Rebecca MacKinnon also describes why SOPA would be really bad for the internet.

I also wanted to take a minute and thank everyone who called or wrote their Congressperson to speak out against SOPA and PROTECT IP. As a result of people speaking up in the last few days, a lot has happened:

- Republican Representative Darrell Issa and Democratic Representative Nancy Pelosi came out against the bill. Rep. Issa said “I think it’s [SOPA] way too extreme, it infringes on too many areas that our leadership will know is simply too dangerous to do in its current form.”

- On the Senate side, Maria Cantwell, Jerry Moran, and Rand Paul all came out against PROTECT IP.

- The European Parliament passed (by a large majority) a resolution criticizing SOPA. The resolution emphasizes “the need to protect the integrity of the global Internet and freedom of communication by refraining from unilateral measures to revoke IP addresses or domain names.”

- Sandia National Laboratories, a part of the U.S. Department of Energy, concluded that the SOPA legislation would “negatively impact U.S. and global cybersecurity and Internet functionality.” Sandia joins Republican Representative Dan Lungren, who also worried that SOPA would undercut efforts to secure the internet with DNSSEC.

The response from regular people has been just as incredible. Consider:

- Tumblr made it easy for anyone to call their representative, resulting in over 87,000 calls to Congress. If you haven’t called yet, this page on Tumblr makes it easy to call your congressperson.

- A ton of web users now have this issue on their radar. The Hill noted that “at one point on Wednesday four of the top 10 searches on Google were related to the bill. ‘Internet censorship’ was still the second most searched-term as of Thursday evening.”

- SendWrite offered a way to send a physical letter to Congress. SendWrite eventually had to put on the brakes after over 3000 people submitted letters to send.

I think this overreach on SOPA will actually make the internet community much stronger. Let me tell you why.

The forces in favor of SOPA have been outspending the tech industry almost 10 to 1 in Washington, according to a recent article in Politico. Here’s an image from that article that illustrates the vast gulf in spending:

Spending of content industry vs. tech industry

And members of Congress are not always the most tech-savvy: the Congressional Research Service tallies only six engineers in Congress. But if you look further out, the picture is quite different.

In 20-25 years, a generation of “digital natives” who grew up with Facebook/Twitter, search engines, and cell phones will start entering Congress. The digital generation will protect technology like the internet from especially bad regulation. They’ll protect technology because they grew up with it and embrace it. So if we can make it through the next 20-25 years, the people in power will protect technology for us, not fear it.

At least, I thought we’d have to wait 20-25 years before a critical mass of people would defend the net. But SOPA has brought that day a lot closer. SOPA galvanized the tech community, from start-ups to venture capitalists to the largest web companies. SOPA was an unexpected shock and a wake-up call. Well, guess what? Now the internet is awake. And I don’t think it’s going back to sleep any time soon. We might need to rally again in the near future, but we can do that. The internet learns fast.

What you can do?

- Sign up at American Censorship to send a note to Congress and get updates.
- Call your congressperson with Tumblr’s easy web page.
- I believe anyone inside or outside the United States can sign this White House petition. If you’re outside the United States, you can also sign this petition.
- Follow groups like the Electronic Frontier Foundation (EFF) on Twitter.
- Sign up with United Republic, a new organization dedicated to the larger problem of money in politics.
- Sign up to have Senator Ron Wyden read your name on the Senate floor when he filibusters against this legislation.

Which charities do you donate to?

Each year I like to ask what charities people are donating to. There’s still a couple days left in 2010, so I wanted to ask readers about their charity or non-profit giving.

I’ll mention the main organizations on my giving list this year:

  • charity: water brings clean, safe drinking water to people in developing nations.
  • The Poynter Institute is a school that trains journalists and would-be journalists, both in person and online.
  • The Committee to Protect Journalists defends press freedom and the rights of journalists to report the news world-wide without fear of harm.
  • MAPLight.org provides tools and data to investigate the influence of money and politics.
  • The Sunlight Foundation focuses on using technology to make government more transparent and accountable.
  • I don’t think I’ve mentioned my Mom’s charity on my blog before, but I did donate money this year to it, so it seems appropriate to mention it. Blessing Hands provides scholarships and other help to students in China. Side-note: in the same way that I don’t accept gifts or free things, if you ever decide to donate any money to Blessing Hands, please don’t tell me; I wouldn’t want a donation to create the appearance of any conflict of interest with my job.
  • The Electronic Frontier Foundation (EFF) defends everyone’s digital and online rights. The EFF has stopped more bad ideas online than I can even count.

Those were the organizations that I ended up giving some money to. Now it’s your turn. What charities would you like to mention, support, or call out?

By the way, I’d still like to find 501(c)(3) organizations with low overhead costs that support open-source software. And I’d still like to find an organization that teaches the basics of journalism online for free. The training could cover the history of journalism, research and fact checking, ethics, legal principles, rights, how to investigate, libel and slander, off the record vs. on background, and so on. Sort of like The Khan Academy, but teaching journalism. If anyone knows of such organizations or non-profits, please leave a comment!

Letter to a young journalist

Don’t conclude from my previous post that I dislike journalism. All through middle and high school I woke up early to read the local newspaper each morning. I was the editor of our newspaper in high school. My mother wanted me to be a journalist. I’ve been thinking of the issues confronting journalism for a few years now.

Back in early 2007, a journalist friend in the Midwest emailed me. He saw the impact of the web, changes in the newspaper industry, and he was worried about his newspaper’s–and his own–future. He asked my opinion on all of this. With his permission, this is what I wrote to him back in 2007 with a few minor edits.

Definitely take me with a large grain of salt–I got lucky in joining Google, but I wouldn’t give my opinions any more weight than an average person’s. :) My personal hunch is that newspapers will have some issues in the years to come. If you think about the fraction of revenue that comes from classified ads, it does seem that revenue will eventually migrate online, and sites like craigslist.org are more likely to capture a big fraction of that traffic compared to individual newspapers or even newspaper syndicates. If a newspaper loses ~30% of classified ad revenue over 5-10 years, that’s really hard to adjust to without structural change.

It’s funny because my Dad basically took a job out of grad school and stayed in the same post until he retired. It seems like the odds of that happening for people like you and me are a lot lower. There’s just not as many companies that are doing things like taking care of their workers for 30 years at a time.

So the first thing I’d recommend is to grab a domain name and work on burnishing your personal reputation online. It’s definitely not the case that everyone needs a blog, but having one place that acts as a face to the world can really help. There’s room for a resume/CV, but also for some writing samples that show off your abilities.

Which takes me to how open-minded [Midwest newspaper] is. I’ve heard newspaper policies ranging from “If you blog, we’ll fire you” to “If you don’t blog, we’ll fire you.” I hope that the paper is pretty open-minded. But they shouldn’t be able to stop you from building up your reputation online in your own time, and even if there’s copyright issues with putting full articles up on your personal site, you could no doubt quote a few excerpts of choice stuff as a part of fair use.

So making the switch to a mental model where you are more like a consultant for any company that you work for, but you look for ways to improve your reputation and learn new skills as you go–that’s a good way to make sure that you’re protected if you unexpectedly end up as a free-agent.

You’ve got a good sense of humor and you’re well-spoken, so the biggest questions to me would be
- what do you love or what are you interested in?
- where do you want to be in 5-10 years?

For example, it would be interesting to know a little bit about your interests. Things like games, gadgets, politics, or technology make great subjects if you want to try some active blogging + something like AdSense to make a little bit of money on the side. But some of the larger issues are things like
- how introverted you are vs. how much you enjoy talking to people.
- what ties do you have to [Midwest city], and do you want to stay around there or are you willing to move?

I’ve noticed that networking and getting to know a few people in an industry can make a big difference. If the tech-field is interesting to you, [Midwest city] is going to be a more limited pool of opportunities compared to something like Silicon Valley/San Francisco or New York. If you like to travel and like meeting new people, it turns out that becoming an expert in a niche and then getting on a conference speaker circuit can be good. You might start out on panels about journalism or media or ethics, but that could quickly lead to consulting gigs, for example.

I do think that the tech industry will be a leading one for the next 10-20 years, and probably biotechnology will start to emerge after that. But I think the service economy will remain strong throughout. Starting to get on a conference speaking circuit is really a way to rebrand yourself as an expert on some topics. That role would let you expand and offer your services/advice in different ways.

I guess the larger issue is that working for a salary is great, but if you can find ways to participate more directly in the success of a company, that can be a faster way to make money. The whole dot com boom demonstrated that there are a lot of dumb start-ups out there, but at the same time, when you’re young is exactly the right time to take risks like that.

Re-reading your email, I guess the smallest step forward would be to find out what you can legally do online that wouldn’t conflict with your employer’s guidelines. Then I’d just try a few experiments. It doesn’t cost much to buy a domain name, so you might consider starting a blog about [Midwest city] or a news site (probably the blog is a little easier to start). Set aside $100 or so (or ask for someone to register a domain for you as a birthday present) and try a few things. Sign up for AdSense and see what sort of articles do well on places like techmeme.com or reddit.com or digg.com. In general, I’ve found that starting with a small niche and building your way up is great practice and teaches you a ton about what sort of things attract attention and good discussion.

Some of that advice has aged well, some less so. I still think that grabbing a domain and experimenting is invaluable. I believe that the entire world is being digitized–from businesses and places to people–and it’s better to be involved in that process than to stand passively and let other people define you online. I believe that participating in the upside of a company is better than only drawing a salary. And I think that most of the time, no one cares about your career ladder or skills development as much as you do. No good company opposes the development of its employees, but ultimately you have to take the initiative and drive your career in the direction you want.

By the way, the title of this post is an allusion to Letters to a Young Journalist, an excellent book by Samuel G. Freedman.

css.php