I’m presently doing some experiments with Nutch, the Open Source search engine that has recently been moved to the Apache Incubator.
I’ve been reading about how Nutch’s Open Source ranking algorithm is supposed to be better being open, but I couldn’t find — either googling around or nutching around
— a complete description of how it ranks pages. Does it take into consideration inbound links like Google’s or not? I’d really like to know.
You can choose if it ranks by links or not
I have the same puzzle.
I even take a close look of Nutch’s code.
indexer/indexSegment.java
gives code like:
calculateBoost ( ) {
// 1. Start with page’s score from DB — 1.0 if no link analysis.
Note: get this initial score from fetchlist—the page rank is from webdb structure
float res = pageScore;
// 2. Apply scorePower to this.
Note: then depends on current segment, adding more rank if link is counted
res = (float)Math.pow(pageScore, scorePower);
// 3. Optionally boost by log of incoming anchor count.
if (boostByLinkCount)
res *= (float)Math.log(Math.E + linkCount);
}
Then, from my view, it doesn’t take count of the page rank of inbounding links. So, I am confused.
Pls let me know if you have another thought.
My email is fji_00@yahoo.com
Michael
If you look in the latest trunk code, you’ll see that things are completely different now, and rather than boosting by the link count, it is boosted by the square root of the sum of the link scores (and the link score reflects the score of the source page of the inlink), so it is in fact what you’re looking for.
Check out the code from svn to see for yourself. Relevant classes are CrawlDb and Indexer.