In the late 1990s, search engines typically used text-based methods to decide which results to return for a query, but these methods lacked a reliable way to judge the importance or quality of web sites. This resulted in average-quality websites being prominent in search results, even though web users typically visit only much higher-importance sites. It is difficult to assign quality or importance ranks to web sites from textual analysis, because this requires human-level insight about semantics.|
The authors of “The PageRank Citation Ranking” present a novel method for automatically assigning importance rankings to all sites on the web, via a recursive system where important sites lend importance to the sites they link to. In the PageRank method, each web site has an importance score. This importance score is divided by the site's number of out-links, and each site it links to receives the resulting quotient in importance. In this way, the total importance in for each site equals the importance out. As an extra caveat to handle the case of sites with no out-links becoming importance “sinks,” each site receives some importance for free, regardless of its in-links. This mechanism models the behavior of a random web surfer who follows out-links uniformly randomly, and occasionally jumps to random web page.
The main contribution of the paper is a new algorithm for automatically assigning an importance score to each node of a graph, based on its links to other nodes. The paper also presents an efficient way of implementing the algorithm (computing the scores), through a simple iterative algorithm that is similar to gradient descent or value iteration. The algorithm converges in just a few dozen cycles for the Web. The authors note that PageRank can be applied to domains other than web page importance, such as academic citation networks. One benefit of PageRank over more naive metrics for page quality, such as counting in-links, is that it is robust to attacks, such as creating low-quality pages that link to one's own page. This attack would not raise one's own page's PageRank very much, because the pages that link to it would have low PageRank.
It is hard to find fault with this paper, as it presents a novel approach to search result ranking, along with an efficient algorithm and especially practical implementation (Google Search).
This paper deals with the implementation of PageRank, a method for rating Web pages using the link structure of the web. It was developed by Larry Page (hence the name) and Sergey Brin the founders of Google Inc. This was motivated by the existence of millions of unorganized heterogeneous web pages on the World Wide Web. It helps in gauging which web page is important and relevant to the query made on the search engine.|
A page has two types of links – forward links (outedges) and backlinks (inedges). If a page has a lot of backlinks, then it is considered important. A simple ranking of a web page depends on the number of backlinks and forward links and a normalization factor which makes sure that the total rank of all web pages is constant. But this model has a problem, during each iteration, the loop accumulates rank but never distributes rank to other pages sinking the rank. A random surfer model is also presented in the paper which corresponds to the standing probability distribution of a random walk on the graph of the web. Sometimes a page has a dangling link which points to a page with no outgoing links. The implementation of the PageRank requires a web crawler to build an index of links as it crawls. Personalized PageRank can also be done helping corporations to improve their searches on their personal search engines helping them get the relevant items swiftly.
The paper is successful in explaining the basics and implementation of PageRank. It has some content on how searching is conducted using PageRank as well which extends the topic well. The applications part sums the usefulness of PageRank in a detailed way.
The paper doesn’t give sufficient examples of the formulas introduced for the calculation of the rank. Sometimes the PageRank of a webpage is less but its usage is high. This anomaly is not rectified effectively.
What is the problem addressed?|
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating pages objectively and mechanically, effectively measuring the human interest and attention devoted to them
Through there is a large literature on academic citation analysis, there is a number of significant difference between web pages and academic publications. Unlike academic papers which are scrupulously reviewed, the environment of web is rather ad hoc containing competing profit seeking ventures, attention getting strategy. There are many axes along which web pages may be differentiated. In this paper, they deal primarily with one - an approximation of the overall relative importance of web pages by random walk.
1-‐2 main technical contributions? Describe.
They gave an elegant formulation of pagerank aligned to the stationary state of random walk on the whole web. Consider the whole Web as a directed graph, each web page is a node and there is a directed edges from web page 1 to web page 2 if there is a link from page1 to page2. They assume the surfer does random walk on the web and shift to web page uniformly among the neighbors, and define the pagerank proportional to the stationary distribution of the Markov chain. That is the probability that user will be certain web page when the random walk is converges.
However not all Markov chain of directed graph can have stationary distribution, or non-degenerated distribution(e.g. the existence of sink node make some node has zero probability ). They introduce the source of rank, E, to resolve those problem and characterized the surfer may "jumps" to random pages which is independent of the directed graph.
To apply pagerank to title search, the search engine finds all the web pages whose titles contain all of the query words. Then it sorts the results by PageRank.
1-‐2 weaknesses or open questions? Describe and discuss.
The paper only consider few ways to define the source of rank, E, and claims that is important. In fact, a more natural way to model surfer behavior is to use transient state of Markov chain instead of stationary state. We can model the source of rank (can be determined for example the bookmarks) as the initial state and consider the transient matrix base on the previous history.
This is a test!|
This paper presents a new method called PageRank to assign relative importance for web pages. In short, PageRank gives the global ranking of all web pages which is based on the location in the Web’s graph structure (regardless of the contents). To be more specific, PageRank is a calculation based on the graph structure, and this paper also presents an algorithm to compute it. PageRank provides search result with better quality to all web users. This paper first gives the mathematical description of PageRank, as well as the implementation concerns. Then it provides several experiments including using a web search engine called Google on PageRank, and the results show that RageRank is efficient to compute and provides good quality. Finally, it mentions the applications PageRank can be applied to.|
The problem here is that there are a large amount of web pages which diverse from each other, which makes it hard to make a search engine ranking function. Web pages vary on a much wider scale than academic papers in quality, usage, citations, and length. Also, the web page is hypertext and provides auxiliary information such as link structure and link text. Thus, a better method to estimate overall relative importance of web pages is needed.
The major contribution of the paper is that it provides a good method PageRank to calculate the relative importance of a web page, and most importantly, the calculation is only based on the Web graph structure not the web page contents. And the paper provides detailed experiments on PageRank to show its efficiency and implementations. One weakness is that it mentioned a special case, Dangling Links, which are simply links that point to any page with no outgoing links. In this paper, it just mentioned we don’t consider those dangling links to calculate PageRank, but it might be better to provide a reason why we need to consider those links.
One interesting observation: this paper is in general good to present an innovative method PageRank to calculate relative importance of web pages. Also, the experiments show that the method is efficient and providing better user experience. However, it might be better to provide the method to measure user experience, such as some way to measure what the user wants and what they actually gets after search. A more solid measurement can show it actually provides better user experience.
WHAT PROBLEM DOES PAGE RANK SOLVE?|
The World Wide Web creates many new challenges for information retrieval. It is very large and
heterogeneous.Hence, it becomes difficult to find out the most relevant pages on the top of the result page of search engine. Page rank is a method for rating the importance of web pages by using the link structure of the web. It finds its application in estimating web traffic, back link prediction, user navigation and many information retrieval task.
Link structure of Web Page:Every page has some number of forward links (out edges) and back links
(in edges) .Forward links can be known at that time when the network graph is downloaded but backward links cannot be estimated.A page rank algorithm states that a page is important if important link refers to it. It is an algorithm which assigns a numerical weight to the page to represent the importance of the page.It forms a probability distribution over web pages, so that the sum of all the page ranks is one. Also, a page rank value for a page u is dependent on the PageRank values for each page v contained in the set containing all pages linking to page u, divided by the number of links from page v.The PageRank theory states that an imaginary surfer who is randomly clicking on links will eventually stop clicking.A vector of pages that the surfer algorithm jumps to also gets added to this equation.At each step, the probability that the person will continue is a damping factor.The damping factor is subtracted from 1 and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores.
This score is calculated each time it crawls the Web and rebuilds its index.
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes PageRank a particularly elegant metric.
The algorithm uses convergence property assuming that the Web is an expander-like graph.It explains the theory of random walk is rapidly-mixing if it quickly converges to a limiting distribution on the set of nodes in the graph.A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue.
Issues of Dangling Links:
Links that point to any page with no outgoing links affect the model since it is not clear where their weight should be distributed. A solution to this is to remove them before page rank calculation and added back afterwards.The algorithm sorts the link structure by ID and makes an initial assignment of ranks and start iteration.
The authors also experiments to give out results stating that the distribution of web pages that the random surfer periodically jumps to is an important component of the page rank algorithm.Having this parameter uniform all over web pages results in many related links with high ranks while a personal page rank consist of a single page.
New pages may be assigned less page rank and they take much time to be get listed and gain high ranks.Search results are based on the literal things but not on meaning.Using this algorithm, it might be easy to manipulate the search results if an organization has time to increase the ranking of their website.
This paper discusses the now-famous Pagerank, one of the primary contributors to Google's early commercial success. Their algorithm is designed to rank pages by modeling a "random surfer". Since the importance of a page is largely subjective, their is no way to accurately judge, or concretely measure, how important a page is. The best approach is then to use the links to a page as recommendation, or "citation" in the terminology of scientific papers. However, unlike academic papers, which are peer reviewed and generally high quality and trustworthy, the internet is very different. Their are many pages that are deliberately misleading, and link farms designed to increase the weight of certain pages. Simple "in-link" counting is therefore not robust enough to accurately measure page value - not all links are created equal.|
The basic idea of Pagerank is that each site has a certain "rank" or "score". Each round, this score is evenly divided by the number of outlinks, and each of that page's outlinks is increase by that value. Their is then the issue of 'rank sinks'. Consider, for example, if 'yahoo.com' (a site with a very high page rank) links to site X, which links to site Y. Site Y links to site X - but those are the only links those two sites have. They will continue to increase each others scores each round, but will not ever "give off" any of its rankings. To eliminate this problem, Larry and Sergey decided to add random jumps into their algorithms - this essentially models a user becoming 'bored' and jumping to a new page entirely, instead of following links. Then, points from these 'rank sinks' are evenly distributed across all pages randomly.
Their are a few things from this paper that are very interesting. For one, their idea of 'scale' in the late 90's was very different than our idea now - their system was ranking 150 million web pages. In 2008, Google indexed 1 trillion web sites, and they now index 30 trillion. Its also interesting that the founders of Google were using their system to get to Wolverine access 15+ years ago - one thing I still use it for fairly frequently. It is also interesting that the algorithm can be used to predict web traffic - since it attempts to simulate a random web surfer, pages will high Pageranks should also be hit with more traffic.
Their are some weaknesses to this paper. For example, the global ranking can warp search results. For example, if I'm trying to search for 'trees', and the page 'yahoo.com' contains the word 'trees', it will appear much higher in the results than, say, the academic pubs I'm looking for. It is also easy to see how link farms could be set up to dramatically increase the Pagerank of a site. While it is slightly more complicated than in the case of straight up link counting, it would still be trivial to warp search results. This is probably why Google' support for Pagerank has been waning, and that they now keep their search algorithm a secret (as well as using over 200 other factors in their search results ranking).
Problem and Solution:|
The problem is that search engines on the Web meet many challenges. The Web is large, heterogeneous and diverse. The users are at different levels. The solution is that the search engines rank the importance of the pages according to their link structure. PageRank is used to measure the importance of the pages and shows the best page first. Highly linked pages have a higher importance ranking. And pages linked by an important page have higher importance. The links are weighted in the PageRank calculation. The PageRank is assigned initially and updated by looping through all the pages and the iterations stop when the weight converges.
The main contribution is that it provides the algorithm and the logic of PageRank to measure the importance of pages. It is an improvement from simply rank the pages by the quantity of links but also their PageRank. It prevents the obscure pages from affecting the result, and decreases the effects of fake links. The ranking of pages improves the quality of the search result. PageRank has more usages than ranking the pages for searching. It can be used to estimate the web traffic. Also, it can predict the backlinks, which makes it a good citation count approximation. And the PageRank proxy can be used for user navigation. It is also a good way to identify whether a page is trustworthy since a high ranking page is more likely be admitted. Another interesting thing that the behavior of surfer is modeled by adding a distribution to simulate the periodically bored of surfers.
One weakness is that it can only get the data of crawlable pages. Though a not crawlable page is not likely to be important enough, the possibility still exists.
Another weakness is that the forward links of a page are known after it is crawled and downloaded. Since the number of pages and links are very big, it is not quite responsive. The pages which grow up quickly in a short time may be overlooked in the ranking.
This paper describes a metric called PageRank, which is used to score how “important” or “relevant” a web page is on the Web, based on both the quality and quantity of the pages that links it. A metric that can score and rank web pages is of great importance in search engines and web analytics, because you need to have some metric to filter which are the sites that you want to present to the user as being the most important.|
The idea behind PageRank is fairly simple. The PageRank of webpage u is defined by R(u) = c*sum(R(v)/Nv,v in Bu) + cE(u), where Bu is the set of websites that point to u, Nv is the number of links that go out of site v, E(u) is a vector that corresponds to the source rank, and c is normalization factor. This value considers not only the number of links pointing to u (summation), but also the quality of those links (by factoring in the Rank of v). In addition, the source rank E(u) is there to resolve the problem of rank sinks, where a loop of links accumulate rank but never distributes it.
PageRank does a great job preventing certain websites from taking advantage of the metric to score high unfairly. Because the rank of the sources are also important, a website can’t create multiple fake links to boost their score because those links will have a low score in the first place. The source rank E(u) also has an interesting property of personalizing the rank to the user’s search patterns, because it can help highlight which sites are important to a specific user to generate PageRank specific to that user.
The paper overall does a great job introducing the concept of PageRank. It was easy to understand the concept, and the results were presented nicely. PageRank is certainly an interesting metric, but it seems like it still has more rooms for improvement. For example, when calculating the rank, it treats all links coming out of a website to be equal. Although it might be more computationally intensive, certain links coming out of a website could be more important than other, for example a link coming out from a banner or front-page image of the site is more important than a link from the terms of service section at the bottom.
This paper introduces a new method to rate the Web page, called PageRank.|
PageRank approximate the importance of Web pages with the general idea: each page gets a rank from all the pages that links in to it and the rank is spread evenly through all the out links to other pages. To solve the problem of “rank sink”, each page will be assigned some tunable initial rank. In this way, PageRank becomes more flexible with the potential to do customized search by tuning that initial rank. This method scales well, since its scaling factor is in log n level.
One of the advantages of PageRank is that it is highly immune to manipulation. To make a page important, it has to make some important page to link to it, but it is often costly to do so. This property makes the search engine robust even under the attempts from commercial interests to manipulate the search results.
Some other significant advantages are that:
1.The rank of a page reflects a relative importance of the page and it can boost the search faster and less expensive. Since for the pages of the high rank, it is possible to build indexes for them.
2.PageRank provides a good tool to search and this tool has high potential to be further improved and implemented in many ways. For example, in the paper, it illustrates many of its works in web traffic estimation and backlink prediction using PageRank.
PageRank still suffers some drawbacks:
1.New pages have low rank. Rank of a page is decided by how it is linked from other pages. As a result, the page newly created, which might be important is not assigned a high rank at first.
2.Though the PageRank deals with rank sink problems, yet if there exists a loop that can be swapped in by some outside link and also swapped out to some outside page, the rank of pages in the loop will be greater than the situation that those pages does not form a loop. In this way, the rank of the page is at risk of manipulation.
Determining the importance of a web page is important for many applications, such as web search engines. It is also a difficult problem; because search engines have so much influence on web traffic, web sites may try to use the format of their site to take advantage of the search engine’s algorithm and gain more traffic. This means that simple metrics such as number of relevant words or number of outlinks are not reliable for determining importance, because they are easily manipulated.
PageRank works by defining a page’s rank recursively, based on its backlinks (the pages that link to it). To calculate a page P’s rank, the rank of each of P’s backlinks B is divided by the number of outlinks that B has, then added to P’s rank. Thus, each page “contributes” an equal portion of its rank to each page that it links to. Because this definition is recursive, each page in the entire graph of the web must first be initialized to some value, and then the recursive definition must be applied iteratively until the values of the nodes converges. The “contribution” part of the definition models the probability that a user on some page will visit any given link on that page, and the final converged rank of the page P models the probability that a user randomly visiting pages will end up on page P.
Some nodes don’t have any outlinks, and this means that the rank that “flows” into this node never flows out. This is fixed by also forcing each node to contribute some portion of its rank to a random node in the set E. By default, E might be assigned to be the set of all pages, and this models the fact that sometimes users visit a random page that isn’t linked by the page they are currently viewing. E might also be assigned as some set of most frequently used pages (or a home page) for some user, which would then allow PageRank to personalize its rankings to a user’s preferences.
The paper clearly sets forth the problems of using certain metrics for assigning importance to pages, and shows why PageRank is less vulnerable to manipulation. It also gives intuitive reasons for why its rankings model actual importance.
Maybe I’m biased but they mentioned Google but never really went into detail about how it uses PageRank. Although Recall/Precision might be hard to quantify, it would have been nice to see more general, quantifiable results for the strength of their search engine instead of just an anecdotal example.
This paper introduces Google’s famous web page ranking algorithm, PageRank. It is the algorithm that literally made today’s Google possible, transforming it from a simple web search engine provider to an Internet giant. Its business model spans across mobile devices, OS, emails and even automating automobiles (driverless car project). The practicability and the importance of the algorithm do not need any justification or emphasis at this point when it has already achieved an enormous success in the past decade.|
Measuring the importance or the rank of a web page is a difficult problem. How one can figure out such thing just by looking at it? Even If a web page has a good-quality content, does it also mean that it is genuinely authentic? In case of books, readers can read a book and share reviews, or it is even possible to simply judge the book by the author’s reputation. Unfortunately, this is not possible for web pages. Who can evaluate billions of web pages in a such way that requires a human intervention? Web search engines are responsible to return search results that are most likely to be relevant to users and we surely need a more systematic method to measure the rank of a web page.
PageRank is an algorithm that utilizes the link structure of World Wide Web. It has started with a simple intuition that highly linked web pages are generally more important than pages with fewer links. It may sound just like an academic citation count, but it is possible that fewer links from highly respected web pages are much more important than hundreds of links from junk web sites. Therefore, the algorithm uses the rank of backlinks to calculate the rank of a web page. In short, “a page has high rank if the sum of the ranks of its backlinks is high.”
The formula used in the algorithm has some additional features to deal with special cases, such as page sink where multiple web pages have links that loop between them without outgoing links to other web pages. Even with these additional features, the algorithm is very simple and intuitive. I think it is why it works so well in practice and has been successful for so many years. This simple algorithm captures the behavior of “random surfer” and it is run iteratively to calculate the rank of web pages since important web pages are likely to be referenced by other web pages more than others. It is remarkable that it works so great in real-world.
The algorithm as-is in the paper has some shortcomings. I do suspect that Google has already fixed most of these or is working on them actively even until today. It is not impossible to exploit the algorithm in order to intentionally gain a high rank on a web page. One can maliciously inject invisible page links that only a web crawler can see into highly respected web pages, directing them to his/her own web page for a high rank. An owner of a popular web site can intentionally create links to give a high rank to a web page. The paper does not address the issues of what happens when high-ranked pages are abused in an attempt to exploit the algorithm.
In conclusion, PageRank is a simple, intuitive and innovative algorithm that made Google the Internet giant of the current era. Even though the algorithm fails to address possible exploitation of the algorithm, we know Google has been working on them as they continue their dominant success in web search engine and advertisement market.
This paper tackles the challenges creating an efficient summarizing the importance and relevance of web pages on the Internet that is immune to manipulation. Unlike document collections, web pages contain more information that just the text contained. Furthermore, web pages do not have to go through a review process like academic papers so there is no quality assurance or cost of publishing. As a result, this paper proposes an idea called Page Rank which is a number representing the global importance of a web page.|
The page rank algorithm is an iterative algorithm. Initially, there is a vector that represents a rank source and acts an an initialization variable. Then, the ranks are distributed through all the web pages based on their links to other pages. The rank of page is determined by the sum of the backlinks going into it. A rank a page distributes is equal to its own rank divided by the number of outgoing links. The paper identifies that there can be entities called rank sinks which are sets of pages with no outgoing links so they infinitely increase in rank. To handle this, the paper attaches a decay factor to the algorithm.
The algorithm uses a random surfer model which treats all pages equally, but the paper remarks that it could improve the model to create personalized page ranks. It created a search engine called Google that used the page rank system to retrieve results on the internet. The goal of the personalized page ranks could be to create a personalized search engine for users. Furthermore, such an engine would be immune to manipulation by commercial entities. However, I feel like the personalized search engine may become a privacy concern. To me, it seems like the search engine would learn a user's habits and preferences. This could be abused if, for example, the engine also decided to show personalized adds based on a person's search history. I think the paper authors should consider the implications of creating a system like this. Maybe users should be allowed to decide if they really want personalized page ranks.
This is the famous paper introducing the PageRank algorithm adopted by Google search engine. The algorithms is developed based on the link structure of large hypertext systems such as the World Wide Web. However, it can really be applied to any domain where the structure can be represented as a directed graph. This paper was published in 1998 and has 8601 citations to date.|
Idea of PageRank:
The PageRank algorithm assumes that a webpage is more "important" if it is referenced by more webpages; also, if a webpage is pointed by some other "important" webpages then it is more likely to be "important." Thus, the PageRank of a webpage is the weighted sum of the PageRanks of the pages that point to it. Based on the aforementioned assumption, the weight should be higher if the page pointing to it only points to a few pages, and the weight should be lower if that page points to a lot of pages since the "importance" has been diluted. The idea also can be seen as that the importance of pages "propagate" throughout the graph following the forward links.
Calculation of PageRank:
To get the actual value of PageRank for each webpage, we can start with some initial values and perform several iterations until the values converge. Some variation of the calculation process might include some randomness into the iteration, for example, by adding a random vector which assigns some probabilities that a webpage goes to another webpage randomly. In the paper, it also mentions that a source of rank vector E can be added to the iteration equation to make up for the rank sink problem. The additional vector is very useful that it can also provide "personalized" PageRank.
This paper also implement PageRank to see performance. In the experiment for the rate of convergence, we can see that the number of iterations needed to reach a given threshold grow linearly with log(n) where n is the number of links. This very desirable since the algorithm scales very well.
There is no doubt that this paper makes tremendous contribution to the modern web search technology. However, its contribution is not limited to web search. Many other real-world problems can also be formulated into a graph problem such as social networks and many online businesses, i.e., Amazon, Netflix. PageRank provides a useful information propagation technique that can be applied to theses domains in order to retrieve interesting results, for example, a lot of prediction and recommendation systems are based on some PageRank-like algorithms.
Motivation for PageRank Citation Ranking|
With a very diverse and large number of pages on the web, search engines face the challenge of not only sifting through the large number of paper, but also contending with inexperienced users and pages that rig ranking functions to provide the information that the user is looking for. Search engines use PageRank to create a global important ranking of every web page to extract useful information out of the vast heterogeneity of the WWW.
Details about PageRank Citation Ranking
A page has high rank, and thus high importance, if the sum of its backlinks (nodes linking to the node, or inedges) is high. The formula for PageRank is R(u) = c sum(R’(v)/Nv + cE(u)) where v is an element in set of pages that point to u, Nv is the number of links from u, E is some vector over the Web pages that corresponds to a source of rank, and c is a normalization factor. To implement PageRank, each URL was converted into a unique integer. The link structure was sorted by Parent ID, and dangling links were removed. An initial assignment, which affects the rate of convergence, is chosen to improve performance. The weights for every page and the accesses to the link database, A, are kept on disk. PageRank computation ends in logarithmic time because the random walk rapidly mixes and the underlying graph has a good expansion factor. In the title search example, title match, which ensures high precision, is merged with PageRank to ensure high quality. The common case for PageRank is the case where a word is common in many pages (i.e. ‘Wolverine’ is common in Michigan’s administrative sites). In addition to common case, PageRank takes collaborative trust into account, so that a page mentioned by a trustworthy site will have a high ranking. PageRanks are protected against commercial manipulation because a page has to convince an important page, or many non-important pages to link to it. Applications of PageRank include estimating web traffic, predicting backlink, and user navigation.
Strengths of the paper
Concurrently with EECS584, I am taking EECS551 Signal Processing and Matrix analysis, and we had just talked about the PageRank and the information gained from the principle eigenvector in EECS551. I enjoyed learning about PageRank in greater detail from this paper, and additionally learning the information about how PageRank can account for common cases and collaborative trust.
Weaknesses of the paper
I would have liked to see the paper discuss more about situations where this PageRank fails to bring up the desired result or is inadequate. Also, I would’ve liked to see the paper talk about the role that browsing history on personal computers has on PageRank
PageRank is the subject of this paper and its use in searching the web. PageRank was used to build Google's search engine back in the late 90s. The algorithm has evolved quite a bit since then but the important properties are contained in this paper. The authors describe the basic algorithm, its mathematical formulation and then talk about the problems that arise and how to modify the algorithm to resolve these issues. The weighted random walk sometimes creates a rank sink where values pile up in the graph structure and aren't redistributed. This can be corrected by creating a rank source. After some defined convergence criteria are met the algorithm stops and the pages with the highest rank are returned by the search engine.|
This paper makes a great contribution to information retrieval, as it defines a new algorithm to retrieve web pages (or other documents) in a corpus (possibly the Internet). Before this paper people were just looking at the content of pages and using vector space models to retrieve pages. This was a big step for search engines and makes things a lot more interesting. Future work on algorithms like this is what brings us the search engines of today. The graph structure provides additional properties to search engines such as making it more difficult for people to create pages that manipulate the ranking results.
Drawbacks of this paper are evident in a further examination of its ranking method. Search engine optimization can take advantage of this algorithm by creating many pages that link to the same page. Users can flood the web with these links, or inject links into higher ranked pages to take advantage of the algorithm. Additional analysis must be done to prevent people from ranking their pages this way. For instance, clustering domains and looking at links from other clusters to determine ranking. Furthermore, the algorithm can return sets of results that are related to each other, but not to what the user wants to find. If the random walk finds that a set of pages that are all very similar to what it thinks the user is looking for but not what the user wants, the user will will have to scroll through pages and might not find what they are looking for. This could be controlled for by using some sort of clustering over the nodes in the resulting PageRank graph.
Part 1: Overview|
This paper presents a breakthrough algorithm to evaluate importance of web pages. This algorithm, called pagerank, finally lead to the birth of world’s biggest search engine company, Google. Larry Page and Sergey Brin actually downloaded almost all the web pages and the links from the internet. This took weeks and a carefully designed web crawler. Pagerank algorithm keeps a simple idea. Pages with high pagerank actually come from their valuable backlinks from other important pages with high pageranks.
Theoretically, pagerank would converge to the largest eigenvector of the transition matrix. Also noted the corresponding eigenvalue would simply be one. This is called Perron-Frobenius Theorem. When the row sum is one, the largest eigenvalue would be uniquely one. By doing iteration the pagerank of all pages would converge to the steady state. To handle dangling pages in the web, they remove the dangling nodes at first and then add them back. In their case this method effect very little.
Part 2: Contributions
Simple idea often lead to great formula, which is simple, beautiful, and powerfully reveals the truth behind. In this case, pages with high pagerank comes from the backlinks from other important pages with high pageranks.
Great contribution to search engines! Thanks to pagerank we can have Google today and experience such convenience of the web. Today one can try to ask google anything and not surprisingly one can find out there were some people out there and had already asked the exactly same question.
Pagerank actually utilizes the convergence of markov chain and the power iteration of the transition matrix with row sum to one.
Part 3: Drawbacks
Holders of those high pagerank pages can sell their pages therefore gain profit. At the same time the buyers can put advertisements on the high pagerank pages.
Dangling nodes in the real life would have even a greater population than those regular nodes. For example, pdf and images are all dangling nodes.
Damping factor is critical in finding pagerank and the speed of convergence. However in this paper it is not fully explored. Well this is probably because the results from this paper is good enough to found Google and they would rather put that in the future work.
The paper discussed about a web page ranking technique called PageRank that helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web. The technique takes advantage of the graphical link structure of the Web to produce a global importance ranking of every web page.|
PageRank allows to order search results in such a way that more important and central web pages are given preference than the less important ones which happen to provide higher quality search results to users. The reason behind PageRank is that it uses information which is not advocated by the web pages themselves. The important information is that their backlinks i.e links pointing towards to the web pages, being ranked, are more important than forward links i.e links which are produced by the web pages themselves. In addition, not all backlinks are considered equally when calculating the PageRank. Backlinks from important pages are more significant, hence have more weight, than backlinks from average pages. The PageRank is then calculated based on the as the weighted sum of various backlinks.
The paper’s main strength is that it tries to determine ranking of pages based on human interests which are the main targets. In addition, the authors support their techniques by providing extensive examples of area of applications. The examples mentioned include: “estimating web traffic”, “back link prediction”, and “user navigation”. Furthermore, given the popularity of PageRank usage in high end search engines, I found the author's contribution invaluable.
One of the main limitation of the paper is that their data size for implementation is limited to millions of webpages. Today the size of web pages to be considered in PageRank is much more than that. As a result, the result presented doesn’t reflect the current scenarios. Furthermore, as PageRank algorithm has been extensively applied in the current popular search engines including Google, it would be great if the different techniques are reevaluated using current workloads which is characterized by huge amount of web data and high performance distributed computing infrastructure. The other limitation is that the authors compare their result with a random web surfer model rather than realistic approach. It would have been better if they come up with a model which resembles an actual web surfer.
In this legendary paper, PageRank is introduced, a audacious task of condensing every page on the web into a single number. This paper was written in 1998, when Google was just a new experimental search engine. The purpose of PageRank is to provide an reasonable order of the search result of related web pages to a search query.
This paper shows how they approach the problem of ranking the importance of the web pages. They started from several intuitions:
1. pages that has more incoming links are more important
2. pages that has incoming links from important pages are important.
So they iteratively redistribute a rank of a page towards its out-linked pages until the results diverges to a stable state. They shows several problems in this simple approach and improved this algorithm. For example, pages with no out links are called rank sinks. The ranks of those pages will keep increasing because they never redistribute their ranks.
The search engine also takes into account some clues from the users, such as bookmarks and front page, so that the search results are different but are reasonable to each user. Some other optimization considerations are also mentioned in the paper.
1. fast divergence to the stable result, makes it scalable
2. combination of title search and PageRank increases efficiency and performance.
This paper uses a lot graph and tables to show the results of the algorithm. It explains the intuition and key ideas behind the algorithm at the beginning, which makes it easy to follow. As you may all noticed, they use the wolverine access example, which suddenly caught my full attention! I like it!
Even though it mentions a little bit about the scalability problem, but there is not enough information about how might the growth of number of webpages be a big challenge to the algorithm. As the number of web pages growing, is it still possible to do a complete iteration? As far as I know, this paper doesn't provide enough discussion about these parts.
This paper describes the famous PageRank algorithm currently used by the Google search engine.|
They begin with a simplified ranking function based on the number of links from and to a given page along with a constant to normalize the rank across the pages. However, this would end up not giving an accurate result for two pages that point to each other but no other page. This is where they introduce the E factor that considers the possibility of change brought about by the random surfer who would break out of this kind of loop of pages one way or the other.
They also take into account dangling links - a page with no links to any other links. The authors calculate the ranks of the pages after having removed the dangling links and then add them back. I think just adding them back after the ranks have been calculated might reduce the accuracy of the ranks. Also, like the example they give of a professor's website having a low rank since it has not been linked or updated, if it could be used to provide an owner of the site of the idea that the link seems to be outdated, it could be well used.
This is a very well written and researched paper in terms of the fact that all the possibilities are considered and taken into account. One of the contributions that I thought was the idea that if a site is mentioned on a highly ranked page, it must be highly trustworthy. One of the examples that clearly shows the efficacy of this algorithm is the returning of "Wolverine access" on searching wolverine and not the definition of the same.
They do mention the idea of personalizing the searches based on the user however, as we have observed in recent times, people do not get the same top results on searching for a given query and this may not be a good thing in a situation where absolute results are wanted.
I didn't realize until about the middle of this paper that it was the very paper describing Google, kinda hard to imagine it right now but that made reading the rest of the paper pretty interesting!
The PageRank Citation Ranking: Bringing Order to the Web|
This paper is the theoretical foundation of google, as a modern web search engine. The main topic of this paper is the pagerank algorithm and its practical application. To begin with, the author introduced that the information retrieval problem in the context of world wide web is challenging, for both the large size heterogeneity of data. And the author took advantage of the link structure of web pages to produce an a approximation of the overall relative importance them. This ranking, called PageRank can hence be used in search engines to quickly sort the heterogeneous web pages.
The core thought of PageRank algorithm is to use the forward and backward link to distribute the importance among all the pages. Directed by the the intuitive idea that, a page pointed by many important links are also important and the importance of a page are equally distributed to all of its forward links, the PageRank algorithm operates by the following update formula towards convergence:
R(u) = c * SIGMA( R(v) / Nv ) + c*E(u)
From the updating formula we can see that the evolving of the PageRank value are bounded by the decaying factor ‘c’ and the rank source E(u) well simulates the fact of random surfing.
After dealing with dangling links, this algorithm converges fast on commodity machines, and the search results presented on the corresponding search engine--GOOGLE demonstrates that PageRank gives more sense in terms of sorting the search result.
To summarize, PageRank is a global ranking of all web pages, and applying it in sorting the search results, more important and central web pages can be provided to user. Besides building search engines, PageRank can also be exploited in other applications like traffic estimation and user navigation.
Nevertheless, due to the limitation or low requirements at its developing time, this paper would have the following drawbacks judging by today’s need in practice:
1.Freshness: Pagerank is calculated by assuming the fact that, it calculates the page ranks of all the crawled web pages. However, in the view of today, web pages are being generated in a very fast way or even dynamically, there is no way to crawl all the content back and run the algorithm until converge in a short time. It may take days and weeks to present the fresh result to the user. Hence, there is a need to improve the algorithm to support fast detection and adding new web pages into the system.
2.Scalability consideration needed: As the data is increasing in an exploding way, processing web content at this scale need a parallelized version of the PageRank algorithm, and it is essential for the user to understand how does this algorithm would scale in terms of data size and number of CPUs. It would be better that the author could talk about the consideration towards this direction and perform a few test to further convince the readers.
3. Evolving web(mobilize): In today’s world, almost everything can be found on mobile devices, and they may turn out to be in the form of mobile website or mobile app. Especially in terms of app content, is this algorithm still applicable? That’s a question needed to be considered by the readers. Because nowadays app content appear more in a tree a like structure, and only a few of them are connected with the outer world, how to adjust the rank source E(u) for them would be worth thinking and researching.
As the number of web pages increasing dramatically, there exists the need to rank the webpage efficiently. In this paper, the author introduce the algorithm page rank which is a method for rating the webpages and helps search engines and users to make sense of how important the page is.|
The topological structure for pages is: there are many incoming links to a page and many outgoing links from a page. The pageRank consider how many links that go into this page and also consider the rate of source page that go into this page. But there is a problem with the simplify ranking function: during each iteration of rank sinks, the loop accumulates rank but never distributes rank to other pages. Then the concept of rank source is introduced to deal with such problem, it will jump to random node based on the E(u).
For dangling links which is links that point to any page with no outgoing links, it can be simply removed before pageRank calculation and added back afterwards.
The paper clearly introduce the algorithm used by google for search engine that is how to calculate the “importance” of each webpage. It is very similar to the concept of Markov chain and can regard the users random click the links and try to calculate the converge value. The page rank is the probability for converged value.
As the number of the pages extremely large and calculation for the pageRank should be very expensive. The paper doesn’t point out how to deal with such problem, and there should be some method to divide the calculation but difficulty is: as the graph is tightly linked, how to divide the work.
In this paper, the authors attempt to solve the problem of discovering the importance of individual webpages. The primary application they are trying to use this for is web search. The authors develop a model where they propagate rank from pages to the pages that they point to. In other words, the rank of a page depends on the rank of pages that point to it. The model, though simple, turns out to be quite powerful as it has a number of applications and is robust against certain kinds of manipulation.|
The authors find that PageRank computation converges quickly, making this practical for real applications. The robustness of this algorithm against fraud also makes it useful web search, the primary use case.
This paper provided few numerical measurements and most of the claims made do not have any quantitative evaluation. An example of this is in section 2.7 where the authors claim that the removal and re-insertion of dangling links should not have a large effect on existing PageRank scores.
This paper introduces the Page Rank algorithm and how it is used as Google’s main algorithm for ranking page searches. With many other ranking algorithms, websites would be able to use simple programs to fool the algorithm into thinking that the website is very important. Page Rank uses metrics that are harder to cheat so that more important web pages are at the top of the search. This algorithm has revolutionized how search engines rank web pages to this day.|
Page Rank works by looking at the forward links and backlinks of a page. We are going to define an important page as a page that has many links from important pages to it. Using this, we can propagate page ranks across the internet. Note that in order to keep page ranks even throughout the internet, we are going to split up the scoring among the different pages that this page links to. For example, if this current page has 200 points and it has 4 forward links, every page it links to will be given 50 points. Furthermore, to prevent rank sinks from occurring, we will introduce web pages that are rank sources and a vector that corresponds to these rank sources. We will then run this algorithm over the entire internet to get all of the page ranks.
While doing so, we may encounter some problems such as dangling links and just how big the internet is. Dangling links can just be removed from the graph because they will not affect the ranks of other pages. To be able to run this algorithm in a reasonable amount of time, we need to generate initial values and then iterate until the values converge.
In general, this paper gives a very comprehensive overview of Page Rank and how it is used in Google’s search engine. However, I do think there are some weaknesses with the paper:
1. How do the web pages start off with points? If Page Rank uses only links from other pages to determine the rank of the current page, no pages will have any points if all pages start off with 0 points.
2. The paper does explain the differences between Page Rank and previous algorithms that are used for ranking, but it does not go through a practical example. I would have liked to see some search executed with both algorithms and the results compared.
This paper discusses the development of the PageRank algorithm. PageRank uses the number of “backlinks” to a page from other pages on the web in order to assign it a single number rank that expresses its important relative to other web pages. A pages rank is a function of the rank of the pages that link to it. The algorithm is run recursively until the page ranks converge. |
One of the most critical features of the PageRank algorithm is that it is difficult to manipulate. Older rank algorithms that only looked at the raw text of a page or counted the number of links could easily be manipulated by putting lots of hidden text in the page or by creating many garbage pages filled with links to your page. PageRank, on the other hand, weights the importance of a backlink by the rank of the page that link is hosted on. Therefore, creating many throw-away pages with links to your age will only add a small amount of weight to your page’s rank, as the throw-away pages will have a very low rank. Conversely, a page with only a few inbound links can have a fairly high page rank if those links are from pages with high page ranks. This allows for a “collaborative notion of authority” in which people can come to trust/give importance to pages linked to by highly ranked pages.
The other main component of the PageRank algorithm is the E value, a vector over the web pages that is used as a source of rank. This value can be used to adjust and personalize page ranks. E can be set entirely to a single page, in which case, children of that page and other closely related pages will have a boosted page rank, or it can be distributed among many sites, such as bookmarked sites, home pages, etc. One of the purposes of the E value is to counteract “rank sinks”, sets of pages that link to each other, but have no outbound links. Even one backlink to such a cycle of pages could eventually cause these pages to accumulate infinite weight if the E value were not factored into the calculation.
One of the primary weaknesses of this paper is the lack of formal testing and evidence. The developers had only a partial map of the web, and a primitive version of Google to test their algorithm, so they could not make many direct comparisons to current state-of-the-art search engines. Also, their initial search engine prototype only matched pages whose titles contained the full text of the query. While this did a remarkable job, I would have liked to see more discussion of how PageRank could be combined with more advanced search criteria than a simple title match.
This paper talked about a method called PageRank, which producing a global ranking of every web page. The global webs create many challenges for the information retrieval. For example, web pages are large and heterogeneous. Therefore, this paper proposed PageRank, a method for computing a ranking for every web page based on the graph of the web.|
First, the paper introduced the link structure of the web. The current graph of the web has roughly 150 million nodes and 1.7 billion edges, which are very large numbers. After introducing the structure of the web, the paper gave the definition of PageRank. For the implementation, the authors converted each URL into an integer, and stored each hyperlink in a database using the integer IDs to identify pages, and then compute the PageRank. The paper showed that PageRank on a large 322 million link database converges to a reasonable tolerance in roughly 52 iterations.
Second, the paper talked about the applications of PageRank, including estimating web traffic, backlink predictor, user navigation. For academic use, PageRank can be used to predict citation counts. Since it is difficult to map the citation completely on the web, PageRank can estimate the citation even more closely.
The strength of this paper is that it covers the PageRank topic in details, including motivation, challenges, implantation, and application. Since the area of web may be new for readers, this paper provides a good description of database application on web, which is an popular area now.
The weakness of the paper is that it provides few examples throughout the articles. Since the web is close to human life, I think that it would be easier and more interesting for readers to study if the paper links the theory with more life examples.
To sum up, this paper talked about PageRank, which is a method of producing a global ranking of every web page.
This paper is an introduction to PageRank, an algorithm for search engines to rank the relative importance of webpages for given queries. This algorithm was the initial core of search engines and was a major part of how google initially started ranking pages. It is important to note however that this rank is done completely query independent, a pages rank will not change for different queries, PageRank is query independent but it is used to help retrieve pages for a query.|
The high level overview of the algorithm is that a page is ranked based on how many links point to it from other web pages and how much weight those links carry. If page A has a rank of .2 and has 4 outgoing links, then each of it’s outgoing links is worth .05 for whatever page it points to. What this ends up leading to is pages that are pointed to often having high page ranks! This also means that all of the outgoing links from that page have high ranks as well; so if your personal website is linked to from youtube.com that will give your page a higher rank than if it is linked to from someone else’s small personal site. The algorithm is run iteratively until eventually convergence occurs to a certain extent.
They did not go too in depth on the algorithm in the paper and I think it would have been nice if they gave an example. I have covered this algorithm (actually had to implement it in 485) in other classes and there are some relatively simple examples that illustrate PageRank extremely well and extremely quickly. I think it would have been helpful if the paper included one of these, rather than solely the algebra and pseudocode.
I thought it was interesting to think about the scale of all the links on the web at that time (let alone now) and how that is a major issue for computing PageRank iteratively. The paper says that for around 300 million links it will converge to a reasonable amount after around 52 iterations. 52 iterations on 300 million links though is quite a lot of computational power and memory. The convergence is linear with respect to log n the number of links according to the paper, which is crucial because n is so large.
Lastly, a few interesting observations I found in the paper (useless but interesting):
1.) It was written when Google wasn’t a known search engine (lolz)
2.) It gave a shoutout to Wolverine Access
3.) No author is listed
4.) It can be used for finding things that have high usage but low linkage, so something people might be trying to hide (the paper referenced pornography).
Overall, I think this was a very important paper that was good at describing a high level overview of a crucial topic for information retrieval. I think it could have done a better job providing examples and going more in depth on the technical implementation of PageRank but other than that it was a good paper. It was enjoyable to read!
The PageRank citation algorithm is motivated by the notion of citations on the internet, i.e. documents linking to other documents. Documents being linked by many other nodes are interpreted to be more important, and nodes with many outgoing edges have a potentially high influence on other documents. Each document can be represented as a node in a graph, with an adjacency matrix for the underlying computation between nodes. Thus, a motivation for this paper is linking users who are browsing the internet (i.e. traveling between nodes), with the most relevant pages they would like to travel to.|
They implement a mathematical interpretation of “rank” that weights each web page based on incoming and outgoing nodes. One of the key considerations in this paper was the “random surfer” dilemma; say a user is traveling between pages. If they get stuck in a loop, they will get bored and leave. Thus, the solution is to add in an escape factor to account for probability of “escaping” a node. The model is essentially a representation of a markov state machine, and SVD computation is done to find the primary eigenvectors of the adjacency matrix in order to search for the most essential pages. However, if, say, google were to keep track of all web pages and build a huge (petabytes-level) matrix, this computation takes days to complete. Originally, the computation was done in batch at regular intervals (i.e. weekly), but there are newer more efficient methods of working this out.
This algorithm is a clever way to represent importance among a graph of documents and capturing this idea. However, there are a couple of drawbacks from using the vanilla page rank citation algorithm: (1) it is slow for large graphs; for doing large computations across huge sets of links, i.e. on the internet. However it is worth it since small updates to the graph do not influence the result very much. (2) The eigenvalue representation of a graph, while powerful in its intuition, is not as fully representative as it may seem. While doing SVD is a magical operation, it essentially just captures covariance among links between each variable, which may not be suited for more complex tasks for quantifying relationships.
The paper talks about PageRank, a method for rating web pages objectively and mechanically, effectively measuring human interest and attention devoted to them. The motivation behind this paper is that it is difficult to measure the importance of the web page, especially in relation to searching on the world wide web.|
Pagerank is a method for computing a ranking for every web page based on the graph of the web (but not the content). The paper then gives the mathematical description of PageRank’s algorithm. In short, PageRank exploits the link structure of a webpage to calculate the importance of a web page based on how many web pages links to that webpage (backlinks) as well as the quality of the backlinks. The intuition in PageRank is: a page has high rank if the sum of of the ranks of its backlink is high. The paper also anticipated rank sink problem (when two webpages are pointing on each other) and how to handle that. Another intuition is random surfer model: a surfer simply clicking on links at random, but a surfer is unlikely to get trapped in loop of web pages. Another issue is dangling links, links which point out to any page with no outgoing links. In this case, PageRank just simply remove the links when doing calculation of rank, and put them back afterwards. The next section talks about PageRank Implementation, where the writers built a complete web crawler (containing 24 million web pages) which builds an index of links as it crawls. It converts URL into unique integer and uses it as ID, removes the dangling links, and makes initial assignment of ranks. The next section covers convergence properties, which shows that the rate of convergence of the algorithm decrease as the iteration goes. Next, the paper shows the implementation PageRank on a working search engine (Google) so as to show/test how the algorithm works in production environment when compared to other search engine (Altavista). The section talks a bit about rank merging and common case issues. Next, the paper talks about personalized PageRank, in which the E value is set to consist entirely of a single web page, which result in different rank calculation based on the E. There is possibility to implement this for commercial interest. Lastly, this paper highlights the possibility of implementing PageRank algorithm for website traffic estimation, backlink predictor, and user navigation.
The main contribution of this paper is proposing the algorithm for searching and ordering the search result based on the order of importance. At the time where most search algorithm is based on the query string, this algorithm emphasizes more on analyzing the link structure to determine the chance of a webpage fulfilling the search query string. It’s also helpful that it addresses the “common case” issue.
However, I notice that the paper does not explain much about initial rank assignments. Since the algorithm will do multiple passes throughout the repository, it means that it uses quite a portion of the memory; not to mention that the algorithm has to repeat the same iteration when putting the dangling links back. How does it calculate the rank with the addition of new web pages (and therefore, new backlinks)? For the convergence properties, while the paper shows the decreasing total difference as the iteration goes on, but it does not show the time spent on each iteration (and whether the time shortens or not).
The purpose of this paper is to present a novel algorithm for determining the importance of web pages. This algorithm PageRank is designed to work well for the common case of search algorithms and outperforms existing ranking algorithms which are often designed more for academic research papers with citations. The authors examine why these methods for importance do not apply as well in the case of Web pages that have varying quality. |
The technical contributions of this paper are mainly involving the PageRank Algorithm and theoretical results surrounding it. The authors present the structures and formulas used in the computation of PageRank. They then examine some basic test cases, comparing PageRank to some existing importance algorithms used by websites. They discuss incremental modifications to their PageRank algorithm as they work through the paper, trying to make it the best algorithm for handling the common case of search queries. They also discuss how PageRank can be tuned via the E parameter if it is based on different “home pages” for determining the rank source. These personalized page ranks have many applications including personalized search engines. Another technical contribution that is very important is the fact that they try to make their PageRank algorithm invulnerable to commercial manipulation. They state that it is (obviously) the goal of large companies to have their web pages come up earlier in search results for certain queries, and in the past companies have been able to exploit ranking algorithms to their benefit. Personalized PageRanks make this manipulation impossible due to the fact that the importance of a page is directly related to how many important pages link to it. Finally, they end with other applications of the PageRank algorithm such as estimating web traffic (since PageRank more or less corresponds to a random web server) and in predicting backlinks of web pages.
I think a main strength of this paper is the mathematical motivations presented for the different aspects of the PageRank algorithm. The authors rely on linear algebra, specifically matrices and eigenvalues to give the reader a more formal notion of why the structures and aggregators for importance they are proposing make sense. I also enjoy the discussing of random walks on the graph of the Web for motivation. As a math major, these kinds of discussions are very satisfying and legitimize their system.
As far as weaknesses go, I think at times the authors base their claims on a small amount of data, yet make generalizing claims. For example, when discussing the scalability of PageRank Computation based on rate of convergence (in Figure 5), the authors state that, based on the two dataset sizes examined, that PageRank will scale quite well for large datasets with a scaling factor that is linear in log(n). I think that making these sweeping claims is a stretch given the 2 data series they are using for results here.
Review: The PageRank Citation Ranking: Brining Order to the Web|
This paper presents a technique called PageRank, which is a method for rating Web pages in an objective and mechanical manner, as well as the application of it in web search and user navigation. The motivation of such a technique is for the purpose of efficient information retrieval given the challenges that the amount and complexity of web information is exploding nowadays. To achieve this goal, an objective measurement of how relevant webpages are and an efficient algorithm to calculate such a measurement are needed and that is what the PageRank algorithm proposes. PageRank sort webpages basing on its relevancy defined on the graph of the web.
What I find interesting about this paper is the part it discusses possible methods of manipulation of this algorithm and how the algorithm is immune, to some degree, to those manipulations, especially when it comes to commercial manipulations. Because in the gist this algorithm evaluate a webpage’s relevance basing on how many other pages links to it and how important those linker pages are, this algorithm comes with some immunity of manipulation in the sense that having important websites linking to a site of interest is costy. However a loop, as pointed out in the paper is that is someone is willing to establish a large number of servers and have them pointing to one page, that page will get a high score of relevancy. Another possible heck of the system, which I’m not sure about, is to have machines automatically generate a particular query and “click” into a particular results from all the results returned by the algorithm. Because in section 5.4 the paper mentions that a site can become popular if large amount of users make consistent visit to it. Thus if a machine (possibly with dynamic IP) can make malicious attempts of increasing number of visits to a garbage site, I’m not sure if this will also manipulate the performance of PageRank algorithm after all.
This paper introduces PageRank, a method to measure the overall relative importance of a webpage by computing the eigenvector of the adjacency matrix, which capture the graph structure of Word Wide Web. Moreover, page rank can also be used to estimate web traffic, user navigation, back link prediction and many other important information retrieval tasks.|
This paper describes the PageRank algorithm in details: considers a hyperlink from page A to page B as a votes to B, and A's rank is split evenly as a weight of the vote for its outgoing pages. Every page has forward links and back links. The algorithm represents the voting’s in a adjacency matrix, and calculate the rank until the matrix converge. To handling the dangling link which only have income links but no outgoing links, the algorithm remove them at the beginning and added them back to calculation after the matrix has converged. Converge here means the first eigenvalue is sufficient larger than the second eigenvalue. The page rank of the webs are then computed as the dominant eigenvector of the matrix.
This paper also introduces Google, a search engine that implement the PageRank algorithm, which I believe is the prototype of the Google we are using right now. It introduces in detail how to calculate the PageRank in a memory limited workstation and present the amazing result retrieve by Google search engine.
1. The author present a novel algorithm to compute the overall relative importance of a webpage by computing the eigenvector of a matrix that represent the Word Wide Web. They implement the algorithm with a Prototype and shows amazing result by using their algorithm. Moreover, this Prototype search engine later becomes one of the largest giant Web company in the world and their PageRank algorithm helps billions of users search for websites behind the scene of Google.com
2. The authors clearly introduce the PageRank algorithm with examples and calculation equations. Moreover, they also explain what inspire them to design this algorithm and what the simulation of this algorithm means.
1. The paper talks about Eigen vectors in several space, but it does not provide a formal definition of the eigenvector, it would be much helpful for its readers to understand the algorithm if they give full definition of the terminology they used.
2. The PageRank is a great algorithm however in order to keep the change of the WWW, a periodically re-computation is needed and this may take a lot of time and computational power which a single work station cannot provide, so a further discussion of distributed PageRank algorithm would be interesting to see.
This paper introduces PageRank, a method for rating web pages objectively and mechanically. It’s important to ranking the searched results in a reasonable way so that people can easily find the information they are looking for. PageRank assigns each web page a number of its importance, by looking at the importance of the edges that point to this web page. Instead of looking at the page’s content, PageRank only uses the information that is external to the web page themselves. The backlinks is similar to peer review in the academic paper publication system. In addition, the backlink are also with different importance, indicated by the importance of the pages that the backlink is pointing from. The PageRank of a page u is Ru = c*sum(Rv/Nv), where v are the predecessors of u and Nv is the number of forward links of v. To get the total ranking of all the web pages on the WWW, the PageRank is computed by first setting a set of web pages with initial rank. then starting from the pages and traverse the whole web until the PageRank converges to a stable number.|
One of the contribution of this paper is to assign different importance to backlinks. Instead of simply count the number of inbound links, PageRank considers the web pages more important if they are pointed by pages with high PageRank. It’s fairly easy to automatically generate enormous amount of links that point to a specific page. However, by using PageRank, this synthetic links are considered negligible, as the machine generated pages are of low PageRank.
The weakness of PageRank is that it is a mechanical and general method of ranking web pages, taking little user customization into consideration. For example, the user’s searching history is a very important factor that indicate the users interest. It would be feasible if PageRank can take the user’s searching interest as an input, thus providing more customized results.