PageRank in academia

(Another post in the “back to school” theme.)

I was reading the June 2006 issue of Nature a few weeks ago back in Kentucky, and happened across a good article by Mark Buchanan. He discussed a recent paper in which scientists decided to rank papers not just by the raw number of citations, but by using a PageRank-like algorithm. One important paper by John Slater jumped from 1,853rd to 10th place:

The Slater determinant slipped into common usage and into a number of other papers that went on to become classics. Today, this paper gets few direct references, but scores points indirectly in Google terms as others papers that cited it long ago continue to accrue new citations.

I don’t see Buchanan’s article online, but a physics student did his own summary. I like the notion that Google sprang from academia, and that things like Google Scholar can make life a little easier for students in return. Besides, you know, free hosting and bug tracking for open-source projects, Summer of Code, Anita Borg scholarships, our free pizza for late-night hacking ambassadors, the free sitesearch and websearch that we offer to universities and non-profits, and stuff like that. Jeez, that’s a lot of stuff. And I forgot about the page for college students that collects our free services. The n-gram data we’re making available to researchers about phrases on the web. The 2006 Code Jam programming contest. Okay, I’m stopping because my head hurts. But it’s clearly a good time to be a student.

31 Responses to PageRank in academia (Leave a comment)

  1. Matt, do you think a paper-citation ranking system should take positive/negative citations into consideration? That is, should a paper be deemed important if, 10 years after it’s published, other papers merely refer to it as “that now-debunked theory”?

  2. FYI: Your blog is remembering mixed-up URLs. I just noticed that.

  3. Hmm, I found eight scholarly papers referring to my website. Any brownie points due for the REAL page rank?

    Also rediscovered my Master Thesis from 15 years ago:

    “Low radio frequency chemical vapor deposit of diamond thin films”

    I’d forgotten the title:-)

  4. There’s a code contest going on?
    Just out of curiosity, Why is php not included in the code contest?
    It’s a very popular programming language for the web.

  5. Hey Matt,

    On the theme of pagerank, just how useful / good is the spam report tool at http://www.google.com/contact/spamreport.html.

    A competitors site (well they compete in theme) is and has been for some time using repeated keywords at the bottom of its pages, repeated link to internal sections of their sites on pages (to the point where a huge percentage of pages on the main site are just full of links!) and also they have twenty or more domains of the same theme all cross-linking to one another to encourage pagerank.

    This site is currently #1 in Google for an important keystring, and in fact this site and two of the sister sites take up three of the top ten results in Google for the keystring! I’ve used the Spam reporting tool mentioned above a few times but to no avail… have you any idea on turnaround times for processing these reports?

  6. Matt, I am forever wondering why the Google Ambassador offer hasn’t reached British shores yet. I am sure I’m not alone in wanting to spread the Google love at my university! Maybe you could drop in a sly word with the powers that be πŸ˜‰

  7. Thank goodness I finished school BG (Before Google). I believe I recall citing the required 99.9% of my sources…

  8. Dean, I will pass that feedback on. πŸ™‚ kumar, I’d guess that’s just because the Code Jam folks have to limit the number of languages. MM, I would guess that would be a second-order effect to include negative citations. Hugh, you might trying doing a spam report through the webmaster console (google webmaster tools). There’s a “Tools” section in the top-right, and we can give those spam reports a little more credibility, because you have to sign in to use that tool.

  9. Michael. What a crazy idea!

    Peer review is all very good, but let’s face it, you can only tell the true value of a document’s worth by counting how many 11 year old bloggers have linked to an article describing how free energy and perpetual motion works πŸ˜‰

  10. Michael, citation analysis normally do not split citations into the positive/negative category. The conjecture is that citations are made because the citer deems the cited work relevant. All other reasons for citing a work are normally treated as unsystematic and ignored.

  11. Interesting – this may have unintended cosequences

    So will Universities/Academics try to game Google Scholar to increase the percived importance of papers coming from thier Uni.

    I’me thinking ofwhen universites are compaeting for grants and govt mony.

    Not sure what happens in the us but every so often all the Uk Uni’s are ranked as to how much reaserch and how important that reasearch is and that feeds into their funding

    Now theirs a niche SEO field – to get into πŸ˜‰

  12. Another attempt at transferring the PageRank algorithm to the scientific citation ranks can be found at http://arxiv.org/abs/cs.DL/0601030

    The authors distinguish between popularity (how often are you cited?) and prestige (by whom are you cited?) and introduce a combination of both (the “Y-factor”).

  13. Dave (Original)

    PageRank! PageRank! Gotta check all data centers now πŸ˜‰

  14. Come to think of Code Jam, I remember reading a blog post recently about some poor Iranian coder who was unable to enter as the whole of Iran was excluded from the competition. I think it was due to stupid (and discriminatory) US export restrictions preventing this.

    If it were an issue of money being given to them, perhaps people in these countries could be allowed to enter but not be able to claim any prize money? Some might not want to, if the money was the primary objective, but it might go a little bit along the way to proving that people in these countries aren’t all rabid psychotic US-hating maniacs, but human beings.

    Could you pass this suggestion along please Matt?

  15. He is getting good at that outbound linking huh? πŸ™‚

  16. Positive/negative citation-based scoring seems like an interesting concept. Similar methodologies have been implemented in standalone studies to determine weight of authority. If it could be done algorithmically, it would be a way to sift through aggregate opinions. Sort of like panning for gold.

    I wonder if Google and Yahoo! don’t already try the same thing? The NOFOLLOW tag would be one way of identifying untrusted sites, but that could clearly be abused. Negative linking would be more accurately identified through the comments surrounding the link anchor text.

    But should poison pen campaigns, either in academic literature or web indexes, be given that kind of power? There are moral and ethical issues which arise from citation-based scoring regardless of whether you score on the basis of a mono-value or bi-value system.

  17. Hmm, all that may give people the impression that the algorythim favors well-aged papers. lol It does seem rather backward-looking on the face of it, but then so does academia. Doesn’t seem ready to replace good old-fashioned editorial review of an experienced expert in the field to know which papers are really important, and which just had good link-bait.

  18. To tell you the truth. I’m a little confuse on pagerank. What is the range from 1 to 10 of Low, Medium, or High pagerank.

  19. I think I’m misreading something. Sorry if I am unclear, but Michael Martinez mentioned “poison pen” as a way to determine rank of a document. Does that mean in more simple terms that Google Scholar will be using a ranking mechanism for documents that is not algorithmic? People might be able to negatively or positively affect rankings on documents via some sort of voting mechanism?

  20. Matt, thanks for the reply and feel free to pass on my email for any pilot scheme πŸ™‚

  21. SearchStudent, I was sort of rambling/thinking online. On the one hand, it’s tempting to ask for a sort of negative or inverse PageRank feature. On the other hand, it would be abused, just as standard PageRank has been abused.

    Presently, Google says there is almost nothing anyone can do to harm your search engine rankings. PageRank in itself only represents a probability that someone will visit a page on a random basis through a link. The more links that lead to a page, the more “important” the page seems to be — although in reality those links can represent a broad number of non-respectful citations.

    For example, when I first launched Xena Online Resources as a Web directory in 1997, I reviewed thousands of “Xena fan sites” that consisted of nothing more than the Xena logo and a handful of links to Web sites that were already well known. People were simply linking to sites they had heard about without providing any sort of editorial direction.

    The miserable failure link bombs are another example, where people are obviously being malicious for political reasons. The links don’t in any way reflect a positive sense of value or editorial direction (beyond political malice). Do these people really want those pages to be deemed “important” or “valuable”? I don’t think so.

    Numerous content-scraping SpamAd sites provide dozens, sometimes hundreds of outbound links as content, purely for the sake of generating content to drive Javascript-based ads from Google and/or Yahoo!. These MFAs (made-for-advertising) sites clearly don’t reflect any sort of positive citation or editorial choices.

    There are other types of links which, if not filtered out of the equations, also distort PageRank — in terms of representing editorial choice, although the likelihood that someone may click on those links might be similar to the likelihood that someone would click on any links at all.

    So, I was just thinking that a negative or inverse PageRank could at least be used to measure contrary opinion or opposition. Derisive linking does actually represent an editorial point of view (as in the politically motivated “miserable failure” links referred to above). A PageRank based on negative opinion might serve some useful purpose in future search engine indexing. But, again, I think it would be subjected to considerable abuse.

  22. SearchStudent, my “poisen pen” remark was a bit of mental meandering. It really has nothing to do with how Google Scholar ranks documents.

  23. Sorry, Matt. For some reason, I didn’t see that my earlier reply to SearchStudent was still here. Only when I posted the second, shorter followup did I see my reply (I’m on a differet machine). I’m just having a Michael Martinez Day, I think….

  24. I just don’t see how this could be seen as a positive contribution to the academic world. As Michael Martinez already mentioned in one of his posts, inbound links may pose a good and bad feedback for your site (paper) so even with websites PageRank has shown great strengths but with time people are more and more concerned about its weaknesses.

  25. Have done some searches on the net lately that made me realize that while, yes, it’s probably a good time to be a student, on the other hand the available references over the net are a tempting source of information, so that they students would not do any work on their own. I mean I haven’t heard until recently of the site that is in fact the resource to look up backwards references to filter out cheating students, so that teachers and editors can spot if someone turns in a paper that seems to be originating from another author or authors!… Meaning if you are only half-into doing your research on the net… which may take as much time as coming up with ideas on your own -.- … anyone is able as of now… to track the source of your information or opinions to their original authors. There was however a couple of years when many students were … living off of reports they found on the net πŸ˜‰

    Besides…
    Regarding PageRank…
    The reason why I found this entry this time.. actually.

    While the algorythm might be good, or at least the basic idea better than not having anything at all… its implementation is… questionable.

    A certain site… ( well… take a guess whose site it is ) had its pagerank decided as 0 out of 10 even before its launch. For its domain was publicised as a static link by a whois harvesting company the same day it was registered… in a context, and at a time when the actual pages were not yet present of course. A certain bot crawled the not so legal information gatherer, and as the only reference / referrer to the actual pages, made it look real bad, thus was not just indexed as undecided pagerank but a 0 out of 10… in which state it still is. No matter how many other entries on the net refer to it. For it’s quite popular among those who even find it. And on the net the option to get someone after making you look bad doesn’t worth anything, the indexing and the damage has been done, and nothing will make it vanish it seems ( speaking after two and a half months of experience ).

    While offline publications sooner or later wander into the corners of libraries, verandas and are forgotten at the summer houses… academy indexes are refreshed by actual people… libraries are overhauled based on demand… algorythms just can’t handle such situations as they’re not programmed to judge pages by their… good will.

    If you know who should be contacted to at least reset the chance-o-meter please tell me, I’m bust.

    I can imagine citations being either accidentally or deliberately too early for someone. Without them knowing it.

  26. Wow. That post was textbook salesmanship 101. I would love to work at Google.

    Did you study marketing or sales at school Matt? I am thinking of majoring in it, and want to know what courses you recommend!

    Great post, as always Matt! Nearly as beautiful as “Emmy”.

  27. great info

  28. Lol! I wonder what the cost per view of that would be.
    But I think would be a great idea because only quality articles would get that higher rank.
    Great Post!

  29. Oh god, what will be next? Chips in their brains? I cannot believe, something have to stay same and that is school system. Great subject Matt, I have to admit that I thought that people don’t care about it but I was wrong.

  30. nice post! new here!

  31. Ranking papers by using a PageRank-like algorithm could be a worthwile idea to test. I wonder if this will become more popular or is it just a short term fashion?

css.php