Review for Paper: 24-Authoritative Sources in a Hyperlinked Environment

Review 1

Text-based search engines had difficulty ranking results to queries on general topics, such as “auto makers,” because some relevant pages do not contain self-descriptive text (e.g., the phrase “auto maker”), and because of the authoritativeness problem. It is difficult to determine from the text of web pages how to rank them by authoritativeness (importance). A score-based approach using the link graph over web pages is useful for finding sites related to a topic and ranking them by importance.

In “Authoritative Sources in a Hyperlinked Environment,” Jon Kleinberg proposes a method for using the graph of a subset of the Web to find authoritative pages on a query topic. The approach starts by taking top results from a text-based search engine on the topic and building the directed-link graph over those pages, as well as other pages added based on heuristics. Certain pages in the graph are then labeled as “hubs” or “authorities,” where hubs are pages that point to strong authorities, and authorities are pages pointed to by strong hubs. These definitions are circular, like PageRank. Also like PageRank, the hub scores and authority scores of nodes in the graph can be computed iteratively, in this case using alternating sweeps over all the hubs, then all the authorities, similar to an expectation maximization update step. After the updates converge, top-scoring hubs and authorities are labeled as actually being hubs or authorities.

The paper presents a novel way of interpreting the web linkage graph, an approach which is similar to PageRank but different enough to be useful in its own way. In PageRank, all pages are ranked based on one value, importance, and each page receives its value over in-links and outputs an equal value on its out-links. In Kleinberg's method, there are two quantities, hub score and authority score, which are transferred in opposite directions over each page's in-links and out-links. This allows a page to be a good hub but weak authority, or vice versa. Authorities and hubs may be useful for different types of search queries, so there is value in learning those labels separately. For example, a search for auto makers might be seeking Honda and Toyota web sites (authorities), or sites with general information on cars (hubs).


Review 2

This paper deals with authoritative sources of information extracted from webpages. Since there is a large amount of information available on the web, it has to be regulated according to the query made. The queries are of three types – specific queries (presenting the scarcity problem), Broad topic queries (exhibiting abundance problem) and similar-page query. The author presents and algorithm to address such problems with the use of an algorithm. For example - It is Impossible to go through all the links listed by a search engine from a broad search topic. For this purpose, the the author’s algorithm helps in narrowing down the results to a relevant few.

A text-based ranking algorithm faces the problem of being scarcely self-descriptive. For example – the term “search engine” doesn’t come up if a source of a search engine is reviewed. On the other hand, Link-based analysis doesn’t need to be self-descriptive. But the right balance between relevance and popularity is hard to be found. A Hub-authority method is used for this which entails construction of a focused subgraph of the world wide web and computing of hubs and authorities (hub pages link to authorities on a common topic). An iterative algorithm is used for this purpose which filters out the top authorities and hubs.

The paper is successful in explaining how the authoritative sources are determined in a hyperlinked environment. An efficient heuristic algorithm is presented by the author which solves the problem of authority and hub weightage. The theoretical background is complemented by the abundant variants and applications.

The paper however drifts a little from the topic in between. The solution provided is not as robust as PageRank when there are perturbations. The runtime is inefficient as well.



Review 3

Their work originates in the problem of searching on the www, which we could define roughly as the process of discovering pages that are relevant to a given query, more specifically for broad-topic queries which would have Abundance Problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. To provide effective search methods under these conditions, one needs a way to filter, from among a huge collection of relevant pages, a smallest of the most "authoritative" or "definitive" ones.

It's important because it critical for locating high-quality information related to a broad search topic on the www. Their technique based on a structural analysis of the link topology surrounding "authoritative" pages on the topic.
1-­‐2 main technical contributions? Describe.
They propose a link-based model for the conferral of authority, and show how it leads to a method that consistently identifies relevant, authoritative www pages for broad search topics. The model is based on the relationship that exists between the authorities for a topic and those pages that link to many related authorities--they refer to pages of this latter type as hubs. They observe that a certain natural type of equilibrium exists between hubs and authorities in the graph defined by the link structure, and we exploit this to develop an algorithm that identifies both types of pages simultaneously. The algorithm operates on focused subgraphs of the www that they construct from the output of a text-based www search engine; their technique for constructing such subgraphs is designed to produce small collections of pages likely to contain the most authoritative pages for a given topic.
1-­‐2 weaknesses or open questions? Describe and discuss
It is challenging to verified the validation of their algorithm without the definition of "authoritative," and besides topological structure of www there are many aspect to explore such like traffic patterns, and the how the link forms along the time.



Review 4

This paper presents a new way to extract information from the link structures of WWW and discover authoritative www sources. In short, this new technique can be used to identify the most central pages for broad search topics in the context of the www as a whole, and it is achieved through an algorithm to identify nubs and authorities in a focused subgraph of WWW.
It first generates a seed set, and expand it, iteratively calculate hubs and authorities until convergence. This paper first discusses the method to construct a focused subgraph of the WWW with respect to a broad search topic (also produce a set of relevant pages rich in candidate authorities). Then, it presents the main algorithm to calculate hubs and authorities. Following that is the related work and extensions. Finally, it provides an evaluation on how to choose a broad topic and the performance of the method.

The problem here is related to the searching on the web, and how to discover relevant pages given a search query. It is hard to evaluate the quality of a search, because it involves human resource which means that human must look through the result and provide a measurement of relevance. However, there are a large amount of pages and new pages are generating every second. It is not realistic for human to do such measurement and a good tool would be helpful.

The major contribution of the paper is that it provides a good method to identify relevant web pages given a query. It provides many graphs and examples to introduce new concepts such as hubs and authorities. Also it provides several examples about how it works. The idea is very innovative, because it can be applied on the link-structure of the web graph. We have learned a lot on graph theory in mathematics, and it is nice to apply some of them on web search. However, there is also one weakness. In this paper, it says the method is good but does not provide solid proof why it works so well. It might be better to provide some mathematical proof.

One interesting observation: this paper is in general good to present a new method to produce relevant pages given search queries. This paper is focusing on the broad-topic queries, and I’m interested on how the method works for specific queries and similar-page queries.


Review 5

The authors address the issue of distillation of broad search topics, through the discovery of “authoritative” information sources on such topics.In this paper they propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure under the assumption that hyperlinks encode human latent judgment This involves the use of eigenvectors of certain matrices associated with the link graph which motivate additional heuristics for link-based analysis
It also discusses the scarcity and abundance problem during search queries and a dilemma of authoritative difficulty in finding an appropriate balance between the criteria of relevance and popularity.

TECHNICALITIES:
The paper addresses many issues using theoretical knowledge and results.Since, clustering helps in dissecting population, the paper describes the construction and use of focussed subgraph of www with respect to a broad search topic, producing a set of relevant pages rich in candidate authorities by iteratively calculating authority weight and hub weight for each page in the subgraph.The authors then move ahead to discuss the main algorithm for identifying hubs and authorities in such a subgraph, and some of the applications of this algorithm. They address the problem of extracting these authorities from the overall collection of pages, purely through an analysis of the link structure in an iterative algorithm which converges to eigen vector value based on the following rules:
– A good hub points to many good authorities.
– A good authority is pointed to by many good hubs.
– Authorities and hubs have a mutual reinforcement relationship.
In this algorithm,the weight of authority equals the contribution of transforming the dataset to first principal component.This algorithm also known as HITS is different from clustering using dimensionality reduction and can be implemented by PCA with limit number of samples.This can also be extended to produce multiple collection of hubs and authorities within a common link structure i.e. similar page queries.
The drawback of this approach is that it cannot find all the meanings of a query string in situation of a string having multiple meanings.At the same time, hubs of different meanings do not have overlap. and only one type of hubs and authorities emerge after iterations of mutually reinforcing.
The authors also draw a comparison between the page rank and HITS algorithm

WEAKNESS
A survey of the structure and results of three user studies,involving the Clever system in this paper had the basic task of automatic resource compilation by the construction of lists of high-quality www pages related to a broad search topic with a the goal of evaluating how the output of Clever compared to that of a manually generated compilation For each topic, the output of the Clever system was a list of ten pages: its five top hubs and five top authorities. All these pages were collected into a single topic list for each topic in the study, without an indication of which method produced which page.Users were then asked to rank the pages they visited from the topic lists as “bad,” “fair,” “good,” or “fantastic,” in terms of their utility in learning about the topic.
From the evaluation of the methods presented in the paper, it is a challenging and difficult to draw concrete conclusions from these studies as the paper lacks proper assumptions and evaluation results on the experiments conducted.


Review 6

This paper attempts to solve the same problem as Pagerank in the previous paper - a text based search engine will return a bunch of pages, but their is no way to order them. This is known as the abundance problem - the number of pages that could reasonable be returned as relevant is far too large for a human user to digest. However, their are a few critical differences between these two algorithms

For one, HITS (Hyperlink Induced Topic Search) avoid the problems that pagerank has with global ordering. For example, if I'm looking for 'trees', and yahoo.com contains that term, than it will always be returned first because it has such a high pagerank. HITS avoids this by working on local, topic-specific subset of the entire graph. It also allows retrieval when the page does not contain the test we are searching for - yahoo.com, msn, bing, and google do not contain the 'search engine', but they should come up if that's what I search. This is the case for many large sites - IBM's homepage does not contain the term "big blue" or "computer hardware", Ford/GM/Honda/Toyota's homepages do no contain "automobile manufacturer"

The algorithm is designed to be decoupled from the search engine. The author's system queries AltaVista, a text-based index to get a list of pages, and then works from their. It cuts down the number of pages to something reasonable - around 200 pages, at publication time. It then expands this set based on in links and out links, and this 'base set' is what is actually operated on. The algorithm then calculates the 'hubs' and 'authorities' on this focused sub-graph. 'Hubs' are collections of links that point to authorities - good hubs will point to lots of good authorities. 'Authorities' are what actually contain the information - good authorities have lots of good hubs that point to them. The paper proves, in math that is beyond me, that the algorithm converges. In my experience (in eecs485), as well as the authors, the algorithm converges very quickly - more than 50 iterations would be very unusual.

This paper presents solid ideas for a search engine. However, unlike Pagerank, which started Google's empire, HITS was never really used in commercial systems (as far as I know, it was only used by one small company that was purchased by ask.com). This is due to a few reasons
- The cost has to be paid for every query. While the cost per query could be small, at scale it becomes expensive. It also slows down response time.
- The math presented is too complex. This paper is very dense, long, and hard to understand. The ideas in the algorithm were better flushed out in later publications, compared to the very simple, short, and easy to read pagerank paper
- No search engine. The system is decouple from a search engine, which means that query performance suffers (transport time), and that the system is dependent on other sites (such as altavista). Also, the author did not create a large test system like Sergey and Page did for pagerank.


Review 7

Problem and Solution:
The problem is that the world wide web contains a lot of information, has great complexity and grows up at a large rate, high-quality searching is quite challenging. The quality of a search method is evaluated by human and shown as relevance. The paper shows the ways to improve the searching quality in the hyperlinked environment. Every searching is viewed as a query. And the authorities of pages for a query are used to rank the result. They are decided by a link-based model. Problems in analyzing the link structure are that the quality of links are not sure, and the appropriate balance of the criteria of relevance and popularity is hard to find. The corresponding algorithm is to construct a focused subgraph of the web to make the hub and authority calculation possible and efficient. The way is growing the root set according to certain conditions. The the hubs and authorities are computed inside the subgraph.

Contribution:
The main contribution of the paper is that it provides a way, hubs and authority, to rank the pages according to the searching queries. This way adds relevance as a measurement criteria to improve the quality of searching result since the popularity may be effected by useless links or ad pages.

Weakness:
The downside of the algorithm is that the decision of the subgraphs may affect the result greatly. A small subgraph makes the algorithm not trustworthy, while a large subgraph makes the calculation slow.


Review 8

This paper describes an algorithm for automatically detecting important and relevant websites on the web based on analysis of the links and hyperlinks that are observed in the web. Being able to determine the authoritative websites is important in the context of web search and improving the quality of search engines. A simple link analysis is not sufficient because of the presence of “unreliable” websites, along with the sheer abundance of web pages on the web.

The algorithm proposed by the paper first constructs a subgraph of the world wide web by creating a collection of relevant websites it wants to investigate. Then the algorithm computes the major hub and authorities within the graph. Hubs are pages that point to many good authorities, while authorities are pages that is pointed by many good hubs. Each page has a weight for authority and weight for hub, and we compute the values iteratively until the values converge. We filter out pages that have low weights to determine the set of authoritative sites.

The paper presents the algorithm in a very concise manner, and points out its weakness and area of improvements. The evaluation is a bit weak, with not much comparison to existing solutions. Also, the method doesn’t talk about the computation time of the algorithm or how long it might take for the values to converge given different data sizes, which can be important.



Review 9

Problem/Summary:

The internet has an enormous number of pages, and therefore web search is an important and difficult problem. Part of the problem is that given any query, there are too many pages relevant to the query to show the user- the pages must be filtered somehow. In addition, many pages that are relevant to the query do not contain any of the words of the query- for example, the websites of many car manufacturers probably do not have the words “car manufacturer” on them, and even if they do, they probably don’t use the term as often as, say, an online publication about cars. This paper describes how the Hubs and Authorities algorithm solves these problems.

The Hubs and Authorities algorithm models the importance of pages in two ways: A page can be important for a given topic if 1) it links to many important pages about this topic (a Hub) or 2) if it is linked to by many hubs (an Authority). Like PageRank, this is a recursive definition, and the algorithm proceeds in a similar way. Each node is initialized with the same Hub and Authority score, and these scores are allowed to “flow” into connected nodes over successive iterations. In each iteration, each node contributes part of its Authority score to the Hub score of each page pointing to it. Likewise, each node contributes part of its Hub score to each page that it points to. The intuition is that pages that point to pages of high Authority should have a high Hub score, and pages which are pointed to by a Hub should have a high Authority score.

Strengths:

While the math in this paper is a little difficult, the logic of the paper is very understandable and the algorithm is presented in a very intuitive manner. While it is hard to quantify the quality of search results, I like that the authors at least included some sort of quantifiable performance metrics.


Weaknesses/Open Questions:

Is it possible that pages in the web might have more than two types of importance? Could web pages be important 3 or 4 different ways, and could algorithms exploit this?

This paper spends sooooo much time talking about other related work.



Review 10

The paper discusses a technique to find “authoritative” sources among broad search topics. These topics could be the result of a search query from a typical text-based search engine. By “authoritative” sources, we can interpret it as most relevant pages to a given search query. It is interesting to see that this paper shares common problem domain and objectives with the PageRank paper we have looked at previously, as it tries to identify “authoritative” sources using a link structure of pages only.

The paper argues that the simplest approach of finding authoritative sources (“authorities”) by finding pages with most in-bound links is not good. Navigational links and paid advertisements are more likely to be returned as authorities than relevant pages. Even if we start with relevant pages, the simplest approach still cannot distinguish between strong authorities and pages that are simply “universally popular”. To overcome this, the author comes up with the notion of “hub pages”, pages that have links to multiple relevant authoritative pages. Hubs and authorities complement each other and the paper calls is as a “mutually reinforcing relationship”. A good hub points to many good authorities and a good authority is pointed by many good hubs. The paper introduces an iterative algorithm to find strong authorities with good hubs.

The intuition and the insight to deduce the notion of “hubs” and its relationship with authoritative pages is the main strength of the paper. It is explained well and clear such that readers could have an impression that this would work even before looking at the result. The actual sample results given in the paper are indeed remarkable. However, it is lacking quantitative evaluation of the algorithm. Mere sample results and a survey from 37 people may not be enough to convince every reader about the effectiveness of the algorithm and use their approach on top of existing search engines. The fact that the algorithm uses the search result returned by an existing search engine as a starting point may also seem problematic as it means it relies on the accuracy and the relevancy of the result returned by the search engine.

In short, the paper discusses another way of calculating the most relevant pages for a given search query like in the PageRank paper. Its intuitive notion of “hub pages” makes it easy to understand their algorithm and readers can anticipate that the algorithm would actually work. It is true that there is no standard benchmark as the paper mentions, but the paper could have convinced more readers with better quantitative evaluation of their approach.


Review 11

This paper introduces a technique for locating high quality and relevant information based on a search query. The idea of the paper is to locate authoritative sources or pages relevant to the information in the query and then present the results to the user in descending order based on the number of links going to a result. In order to do this, the technique first uses a text based search in order to form a root set which contains a large number of pages relevant to the query and is a much easier space to search compared to the entire web. At this point, the root set is expanded to a new set based on pages that point into the root set or the root set points to. At this point, the set of pages contains all the authoritative pages, but not necessarily all the pages that are thematically relevant. To filter out pages that shouldn't be include such as advertisements, the paper introduced the notion of hub pages and goodness based on the observation that thematically relevant pages should have a large overlap in the pages that point to them. The identification of good hubs and good authorities allows the technique to filter out bad results.

An interesting observation about the technique in the paper is that it only uses the query text to create the initial root set. Afterwards, it never looks at the text again and does all of the processing and result creation based on link structure. This seems counter intuitive to me that the query is involved so little in the actual search. This also seems to present a problem since the results produced are based a ranking system that only considers that in-degree number. This technique seems to only look at "what" the query. However, it may also be important to consider "who" supplied the query. The current ranking system does not seem to have room to consider this, so I think it may be interesting to look at a different ranking system for the results. Not only can lots of information be gleamed from the link information of web pages as the paper mentions, but also the context in which the query was performed.


Review 12

This paper introduces some techniques for extracting information from link structures such as the World Wide web. It is published in 1999 and has even more citations than the PageRank paper. To be more specific, this paper focuses on the "searching" problem: return a list of relevant pages given an input query. The evaluation of this problem involves human judgement. Also, there exists a trade-off between the time spent on retrieving the results and the quality of the results.

The link structures can first be formulated into a directed graph, where the nodes are webpages and the edges are links pointing from one webpage to another. This paper classifies input queries into three categories: specific, broad-topic, and similar-page, which can be inferred by the namings.

The proposed algorithm isolates a small regions as a "sub-graph" by finding a relevant set of pages given a query. The sub-graph is first constructed by adding some most relevant pages which serve as "root pages" based on text analysis. The other pages are then added to the sub-graph incrementally. The resulted sub-graph is then the focus of the following searching process, which significantly simplify the complexity of search the whole graph. The proposed algorithm then turns to finding the proper ordering of the relevant pages by investigating the "authorities" (having a large in-degree and also having significant overlap in the sets of pages that point to them) and the "hubs" (having links to a lot of authorities).

One interesting point I find in this paper is that: the "root pages" found by text analysis are not structurally related. Intuitively, pages that are relevant to the query should also be relevant. But this result does not seem like this. I think of this as that a query might belong to several different latent topics. Pages within a laten topic are structurally related, that is, having shared links. Thus the first step of finding the root pages actually identifies the different latent topics and picks several representative pages for each latent topic. The second step of adding pages into the sub-graphs is then using the graph structure to further include other pages in these latent topics based on their relations with the representative pages.

This paper is published about the same time the PageRank paper. This one focuses on narrowing the search space by using the sub-graph idea and considers hubs and authorities to find more relevant pages. The PageRank paper focuses on ranking all the pages based on the graph structure. It would be more interesting if there is some combinations of these two ideas.


Review 13

Motivation for Authoritative Sources in a hyperlinked environment
Search engines need to have the ability to pull out useful information from a very diverse and large number of pages on the web, and the amount of relevant information just keeps growing, making it more and more difficult for individual users to filter through the pages that are candidates for useful information. In order to do so, this paper proposes an algorithm that calculates the authority of papers and determines the “hub pages” that join them together in a linked structure.

Details about Authoritative sources
Analysis of hyperlinks provides information on the authority of a page, because the number of pages that point to the page can determine the authority of the page. However, determining authority through hyperlinks has drawbacks because links can be created for a variety of reasons, and may not confer authority. There is also difficulty in finding balance between relevance and popularity. This paper proposes a link-based model, based on relationships between authorities and pages that link to authorities (hubs), to consistently identify relevant and authoritative pages. The implementation starts with constructing a focused subgraph. The hubs and authorities are then calculated with an iterative algorithm, where the result converges to the principle eigenvector, which provides us the hubs and authorities for the broad search. The paper found that hub pages link densely to a set of related authorities. This results in an equilibrium between hubs and authorities that recurs in the context of many categories of topics on the internet.

Strengths of the paper
I am taking EECS551 Signal Processing and Matrix analysis concurrently with EECS584, and we had just talked about the hyperlink matrix and the information gained from the principle eigenvector in EECS551. I enjoyed learning about hyperlink matrix in greater detail from this paper, and additionally learning the information about hubs that can be gained from the principle eigenvector.
Weaknesses of the paper
I would have liked to see the paper discuss more about situations where this method for calculating authoritative sources fails or is inadequate. Also, the paper mentions Bibliometrics, and says that the impact of bibliometric lacks. However, I would have liked to see the paper defend itself more about why pages cited in bibliographies are not given more authority.



Review 14

Authoritative Sources in a Hyper-linked Environment is an insightful paper on information retrieval by Jon Kleinberg. This paper discusses how to identify pages that represent authorities in a network of web pages. He motivates his methods by the scarcity and abundance problems and the growing framework of the Internet in the late 90s. His paper discusses the identification of authorities and hubs for the purpose of retrieving relevant information to a users query. He discusses the mathematical formulation of this problem and its applications. Near the end of his paper he discusses when it might not be appropriate to use these methods as well as currently ongoing work and future work.

This paper is strong in that it takes simple graph theoretic ideas and applies them to an important problem of the time in a novel way. The idea is well motivated and effectively introduced and provides a strong connection to concepts in linear algebra. The author describes how the content of pages is related to each other and this can be inferred from the structure of subgraphs in the Internet. This is fascinating to me and represents an implicit semantic relationship between the concepts contained in the unstructured pages in the web. This sort of analysis has led to other types of other important applications of graph based analysis in natural language processing.

The evaluation of this paper is weak. They acknowledge that the evaluation is difficult to satisfactorily achieve in this domain. They rely on human judgments which are not always a very reliable way to measure results. They state that the original conference paper did even contain this evaluation, just the list of example queries throughout the paper. Humans are often not consistent, or even self-consistent in the evaluation of tasks like this when asked explicitly to perform the same task. The best of the user studies had people rate sites from one of four distinct categories describing the quality of the site for learning. It would have been better to have them answer questions about the material they were supposed to learn about, or to measure peoples time using a page, or to see which pages people enjoy more than others for tasks other than learning about a particular subject. The web is a diverse place and people use these pages for a wider variety of tasks than those assumed in the user studies.


Review 15

Part 1: overview
This paper presents a new algorithm to analyze web page’s hub and authority values. There are two types of important pages in the internet, the authority pages which includes the government’s web page, the big database vendors’ homepages, university home pages and so on. These pages are authoritative as they are provided by the trusted companies, schools or other authorities. The other type of important pages is called hubs, which basically are collections of links to those authority pages. The web is represented as link based graph in this paper. To calculate the authority and hub of certain page, they use iteration just like pagerank. The authority and hub would converge according to power method in linear algebra.

part 2: contributions
1, This Paper explores the web as link based graph and uses link analysis to handle page evaluating problems, which is at that time a breakthrough. Webs are hyper texts including links and graph analysis could be useful. This paper actually came out before pagerank, which is kind of exploring a way for link based graph analysis of the web.

2, This algorithm of calculating authorities and hubs actually takes advantage of power iteration just as the pagerank algorithm. Which is on the right track. This kind of algorithm can be highly parallelized and thus enhance performance in the big data situation.

3, Although the data size is limited, it can catch good hubs and present them as well. While in the pagerank algorithm we only care about backlinks. This is often useful when doing a small scale personalized search.

Part 3: drawbacks
1, Authorities and hubs need to be calculated in real time. When query arrives the system starts to calculate the authority and hub of certain pages and find the most authoritative or collective pages. Which means we need to perform a lot of calculation at front end.

2, This algorithm could be easily cheated by creating a good hub and at the same time link to some other advertisements which should not be presented to users.

3, The suitable data size or page number are typically small for this algorithm, which is not useful for a scaled search. Nowadays as big data thrives in the industry, we oftenly need some algorithm that is highly scalable.



Review 16

The paper discussed a technique for finding high-quality information related to a search topic on the internet. In particular the paper discusses about finding “authoritative” and “hub” pages based on the a structural analysis of the link topology surrounding the page. Whether a particular page is authoritative page for a search topic determined by the amount link towards the page. And whether a particular page is “hub” page for a search topic is determined by the amount of link outgoing from the page. Both information are important to filter the otherwise millions of pages returned by search engines for a query.

The paper also discuss different efficient heuristic algorithms to avoid pathological result while solving the authority weights and hub weights of a particular page. For example, suppose a large number of pages from a single domain all point to a single page p which quite often this corresponds to a mass endorsement, advertisement, or some other type of “collusion” among the referring pages — e.g. the phrase “This site designed by . . . ” and a corresponding link at the bottom of each page in a given domain. To eliminate this phenomenon, a heuristic algorithm can be designed to only allow up to m pages from a single domain to point to any given page p.

The main strength of the paper is that it provides query-driven dynamic page ranking as opposed to PageRank which statically compute the ranking for all the webpages before the query. This allows to have a better result which suits the current query. It also provides a solid theoretical techniques to address such important problem of extracting high-quality information.

The main limitation of the paper is it might be inefficient during run time as it has to compute the “hub” pages and “authoritative” pages for each query. In addition, the technique was evaluated when the amount of web pages on the WWW was millions not billions as in the current web. As a result, it would be nice to evaluate the techniques using modern huge data set and high performance distributed computing infrastructure and compare it against the state of the art page ranking techniques. Nonetheless, I found the paper to be nominal in addressing such important issues which lied the ground for the current solution.



The paper discussed a technique for finding high-quality information related to a search topic on the internet. In particular the paper discusses about finding “authoritative” and “hub” pages based on the a structural analysis of the link topology surrounding the page. Whether a particular page is authoritative page for a search topic determined by the amount link towards the page. And whether a particular page is “hub” page for a search topic is determined by the amount of link outgoing from the page. Both information are important to filter the otherwise millions of pages returned by search engines for a query.

The paper also discuss different efficient heuristic algorithms to avoid pathological result while solving the authority weights and hub weights of a particular page. For example, suppose a large number of pages from a single domain all point to a single page k which quite often this corresponds to a mass endorsement, advertisement, or some other type of “collusion” among the referring pages - which usually don’t have relevance to the actual queries and a corresponding link at the bottom of each page in a given domain. To eliminate this phenomenon, a heuristic algorithm can be designed to only allow up to n pages from a single domain to point to any given page k.

The main strength of the paper is that it provides query-driven dynamic page ranking as opposed to PageRank which statically compute the ranking for all the webpages before the query. This allows to have a better result which suits the current query. It also provides a solid theoretical techniques to address such important problem of extracting high-quality information.

The main limitation of the paper is it might be inefficient during run time as it has to compute the “hub” pages and “authoritative” pages for each query. In addition, the technique was evaluated when the amount of web pages on the WWW was millions not billions as in the current web. As a result, it would be nice to evaluate the techniques using modern huge data set and high performance distributed computing infrastructure and compare it against the state of the art page ranking techniques. Nonetheless, I found the paper to be nominal in addressing such important issues which lied the ground for the current solution.



Review 17

===overview===
In this paper, a set of algorithmic tools for extracting information from the link structures of a hyperlinked environment is introduced and demonstrated with experiment results. The tools focus on the use of links for analyzing the collection of pages relevant to a broad search topic and discovering the most "authoritative" pages on such topics.

The key problems are:
1. from a global view, the net is disorganized, even though it might be well organized locally
2. we don't have human notions of quality in ranking the pages.

The search engine also takes into account some clues from the users, such as bookmarks and front page, so that the search results are different but are reasonable to each user. Some other optimization considerations are also mentioned in the paper.
1. fast divergence to the stable result, makes it scalable
2. combination of title search and PageRank increases efficiency and performance.

===strength===
This paper uses a lot graph and tables to show the results of the algorithm. It explains the intuition and key ideas behind the algorithm at the beginning, which makes it easy to follow. As you may all noticed, they use the wolverine access example, which suddenly caught my full attention! I like it!

===weakness===
Even though it mentions a little bit about the scalability problem, but there is not enough information about how might the growth of number of webpages be a big challenge to the algorithm. As the number of web pages growing, is it still possible to do a complete iteration? As far as I know, this paper doesn't provide enough discussion about these parts.




Review 18

This paper is a highly influential paper, especially considering is a reference in the PageRank algorithm paper as well. This paper describes the idea of being able to model the results using authority and hubs based on the search query entered.

The author addresses the problem that a specific query might undergo the scarcity problem where there are not enough pages that contain the required information. Whereas, a general broad-topic query might undergo the abundance problem where there are too many relevant pages just in terms of the search query. The author also addresses a very significant problem that the most relevant link for a given search term may not specify that on its page i.e, Volkswagen would not probably have the term "automobile manufacturer" on its page even though it most definitely is.

This is where the author introduces the idea of hubs and authority. A page is defined as an authority for a given search query if it contains valuable information about the subject. The second type of pages that are relevant to a query are pages that contain useful links pointing towards these authoritative pages, known as hubs. The authority is considered better if it has more hubs pointing to it and a hub is considered good if it points to many authoritative links.

The HITS algorithm identifies these pages based on the authority and hub weights based on the conditions specified above. When a query is searched, the top 200 documents containing highest occurence of the query are selected and this root set is extended by including all edges pointed to and coming from nodes in the root set. This extended set called the seed is highly likely to have the most important links.Using these links, an adjacency matrix is constructed based on the hub and authoritative weight in order to order the pages in terms of relevance.

One of the major advantages of using this algorithm is that it is highly likely to return an absolute highly relevant set of links in response to a given query.

The major difference between HITS algorithm and PageRank is that PageRank assigns a rank to all the pages in the WWW whereas HITS algorithm operates only on the subset based on the query specified. Since the results are calculated every time based on the search query at run time, HITS algorithm can be very slow. It also has the problem that the subset might contain a node that has a high authority score for an unrelated topic.


Review 19

Authoritative Sources in a Hyperlinked Environment

In this paper, a new approach for determining the authoritative sources in a hyperlinked environment is proposed. This new technique exploits the link topology to filter out the high-quality information relevant to a certain broad search topic.

Firstly, we need to retrieve the root set R of relevant pages by using the first t text based search engine results. And then add all pages the R(i) points to and d pages that points to R(i) to form the base set S for a query.

Since the relevant content on the web associated with each query can be so rich, there is a need to use other information to decide the degree of relevance than just using the plain text analysis. The novel idea proposed in this paper is to use the structure of the hyperlinks in the web graph to further help finding the relative importance. The guideline thought is to use the new concept of hub to assist exposing authorities. The author revealed the mutually reinforcing relationship as:

a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.

Hence the formula is formed as:

authority weight: X (initialized to be 1)
hub weight: Y (initialized to be 1)
X

= SIGMA{Y | there is a link from q to p}
Y

= SIGMA{X | there is a link from p to q}

so the iterative algorithm run the above formula k times, and returns the largest c hubs and authorities.

To sum up, the strength of this paper is that, it provided a new way of deciding the top relevant authorities for a broad search topic. Judging by the few special test results, we can see that quality of the the ranked result is pretty much close to the real authoritative pages about the test topic. On the other hand, this algorithm itself made well use of the plain text search engine to shrink the base set of searching in constantly small amount, and hence ensured a fast run time for each query. Also this paper contains many graphs that can help the user to understand the topology structure of web pages, as well as how the algorithm made use of it.

However, there are also some weakness in this paper:

First of all, in the performance test part, there isn’t enough tests conducted. Only a few author specified test are provided to illustrate the efficiency and quality of the authority searching. In contrast, what the reader may concern the most is probably about how this algorithm perform on a bunch of random selected topics. And also, since relevance itself is not a trivial metric to determine the search result, the author should define a better metric for his algorithm.

Secondly, This paper only provides an algorithm to respond to a single query, and not much implementation detail problems are addressed. As is well acknowledged, topic about web searching has to be associated with scaling to multiple users. If for each user query, a searching system have to retrieve the base set S and run the iteration on it, depending on the parameter k, c, t, and d, the system load may be very large. So how scale such searches to mass users and how to optimize the system latency may be worth exploring for the author.







Review 20

In this paper, the authors describe a system to compute authoritative sources of information on the internet. Unlike PageRank, this paper makes a large number of assumptions about the underlying structure of the web, namely that it is formed of hubs and authoritative sources.

This paper had a large number of figures which helped explain the concepts. I also like that this paper made numerous references to the World Wide Web. However, I thought that the PageRank paper was stronger than this given that it made fewer assumptions about the environment.


Review 21

This paper explains how we can find the authoritative pages for a search query in a hyperlinked environment. The problem with many queries that search engines get is that there are multiple types of queries and just finding pages that include the text of the search query does not give us a “relevant” set. Search queries can be broken down into two types: specific queries and broad topic queries. With specific queries, it is hard to find web pages that will answer the exact query and with broad topic queries, it is hard to find the web pages that are actually relevant.

To solve these problems, we can just look at a subgraph of the entire web. For any search query a, we want to find a subset that is relatively small, is rich in relevant pages and contains many of the authorities of that topic. To get this list, we first find the collection of t highest ranked pages for that query. This list will get us a relatively small subset of the internet that is rich in relevant pages, but it will not guarantee the top authorities. This is where we can expand the current set to contain more pages that are stronger authorities.

Finally, because we want authoritative pages that are also relevant to the current search query, we will want to find hubs that point to this authoritative source to weed out the pages that are just universally popular. A page can have many other pages in the root pointing to it, but the page could just be popular among all topics, not just the one that the user is searching for.

Overall, this paper introduces an algorithm that can extend current search engines to create more accurate results. However, I have some concerns about the paper:

1. There is additional work that has to be done in order to get this new set of results, but there was never any analysis about how long this would take. I would have liked to see the tradeoff between how accurate current search engines already are and the time needed to complete the additional work.



Review 22

This paper discusses how to identify the most relevant search results for a given query from among all the pages on the web. Two of the main problems addressed in this paper are the problems of scarcity and abundance. The problem of scarcity is that relatively few pages, including some of the most important pages on a given topic, may contain the given query terms. The problem of abundance represents the other side of the coin, arising in cases where the number of pages that could be considered relevant is far too large for users to sort through. This paper uses the number of links into and out of web pages to define mutually reinforcing authorities, pages that are most relevant to a given topic, and hubs, pages that contain links to thematically related authorities.

The authors propose an algorithm in which a small search space of results is generated by first gathering the results returned by a text-based search engine such as AltaVista. Then, this pool of pages is expanded to include all pages linked to by the set of pages in the pool, and up to some fixed limit of pages that link to the pages in the core set. Then, a directed graph is drawn between all the pages in the network. The authors observed that, when relevance was calculated by using the number of links into a page, the most relevant results were typically near the top of the rankings, but other pages that were very popular but unrelated to the query were also near the top.

The authors noticed that the most relevant results were typically part of densely connected networks of hubs, which point to related authorities, and authorities, which are pointed to by many hubs. In order to identify hubs and authorities, the authors developed an algorithm that iteratively assigns a hub score to a page based on the authority scores of the pages it points to, and assigns an authority score to a page based on the hub score of pages that point to it. Linear algebra can then be used to obtain a principal eigenvalue that contains the most relevant results based on these two scores.

My chief concern with this paper is that it spends a great deal of time delving into the complex linear algebra involved in calculating the authority and hub scores of pages, but little time demonstrating its performance against existing search engines. I also would have liked to see more development of the related papers they mention near the end in which other groups applied this technique (or variations of it) in a way that had direct access to the web, rather than relying on a third-party search engine to generate the initial pool of results.



Review 23

The paper developed a set of algorithmic tools for extracting information from the link structures of the web environments, and report on experiments that demonstrate the effectiveness in various contexts on the World Wide Web. The challenge of this work is that a hyperlinked environment can be heterogeneous in the content. Thus, the paper found the problems of search and structural analysis particularly compelling in the context.

First, the paper introduced their research, including queries and authoritative sources, analysis of the link structure. The research is based on viewing the collection of hyperlinked pages as a directed graph. Then, the authors computed hubs and authorities on this directed graph.

The strength of this paper is that it provides algorithms in details that can convince readers. On the other hand, the weakness of the paper is that it does not provide many examples that are related to life to make the topic easier to understand.



The paper developed a set of algorithmic tools for extracting information from the link structures of the web environments, and report on experiments that demonstrate the effectiveness in various contexts on the World Wide Web. The challenge of this work is that a hyperlinked environment can be heterogeneous in the content. Thus, the paper found the problems of search and structural analysis particularly compelling in the context.

First, the paper introduced their research, including queries and authoritative sources, analysis of the link structure. The research is based on viewing the collection of hyperlinked pages as a directed graph. Then, the authors computed hubs and authorities on this directed graph.

The strength of this paper is that it provides algorithms in details that can convince readers. On the other hand, the weakness of the paper is that it does not provide many examples that are related to life to make the topic easier to understand.



The paper developed a set of algorithmic tools for extracting information from the link structures of the web environments, and report on experiments that demonstrate the effectiveness in various contexts on the World Wide Web. The challenge of this work is that a hyperlinked environment can be heterogeneous in the content. Thus, the paper found the problems of search and structural analysis particularly compelling in the context.

First, the paper introduced their research, including queries and authoritative sources, analysis of the link structure. The research is based on viewing the collection of hyperlinked pages as a directed graph. Then, the authors computed hubs and authorities on this directed graph.

The strength of this paper is that it provides algorithms in details that can convince readers. On the other hand, the weakness of the paper is that it does not provide many examples that are related to life to make the topic easier to understand.



Review 24

This paper is an introduction to the Hubs and Authorities algorithm used for information retrieval and search queries. The problem they are solving is if you want to search for “Harvard” you likely want to return “Harvard.edu” not one of the millions of pages that has Harvard in it. To solve this, they have taken the approach of returning whatever is the highest ranking hub/authority for the query.

The Hubs and Authorities algorithm is an iterative algorithm to find the best hubs and authorities on the web. This is a graph based problem where the connections between the graph are links from one page to another. A good hub is described as a page that links to many good authorities, and a good authority is is a page that is pointed to by many good hubs. This is computed by taking the sum of all the scores from the incoming links and normalizing it and then repeating. The page goes into a much more in depth mathematical proof of the algorithm but the last sentence is the high level overview.

The next few sections of the paper are different scenarios and how the hubs and authorities scores can be slightly tweaked for those given scenarios to return better pages, but I don’t think there was any things in them of major importance. Once you understand the hubs and authorities algorithm laid out in section three the other sections are relatively simple.

In terms of evaluation, they showed what example searches would have returned multiple times throughout the paper. Because of this, even though they didn’t have any graphics for evaluation I think the evaluation was specific, because the most important criteria is what is returned and they demonstrated that throughout the paper.

One weakness I saw in this paper was lack of numbers behind examples. I think it would have been very helpful to make a simple three or four node example and show the computations of the hub and authority scores for each of the nodes, rather than trying to explain it all with algebra.

Overall I think this was a solid paper! I think it could have done a better job of using examples with numbers to compute the hub and authority scores but that is because I love numbers and am not a big fan of algebraic proofs. Also I think too much time was spent talking about other work in the field but that has it’s merits as well.

Random thing I found funny in this paper:
They had a totally different idea of what a “Social Network” was when this paper was written.



Review 25

This paper presents a similar idea to the PageRank citation paper; it aims to propose a method for determining authority among documents in a connected network (i.e. the internet), as well as documents that tend to link such documents together. Using semantic analysis (i.e. tf-idf score) can be interesting, but can be subject to skew from spam documents or may not capture the idea that we want. In addition, pages tend to not be explicitly self-descriptive (i.e. my personal webpage will not say “[NAME REDACTED]” everywhere, unless I spoke about myself in the third person. That’s right, me, [NAME REDACTED].

Unlike page rank, this algorithm tries to decouple relevance and popularity (i.e. largely linked sites may not necessarily be authoritative). They propose an iterative algorithm by analyzing the topology of the graph structure based on “connectors” and link structure. The key points of the algorithm were (1) they were able to evaluate an efficient heuristic for determining authority as well as connectivity, (2) it is query-driven, i.e. results are determined by user input, and (3) it is flexible enough to be applied to several variant applications.

However, there are several limitations that need to be addressed, perhaps with some modification to the algorithm or to the model paradigm. For example, if there is some variance in link structure, the algorithm is subject to change. Furthermore, the structure of the rankings for authority may change depending on semantic structure of the links, leading to drifts in topic modeling. In addition, the algorithm seems to be slow when running, making it inefficient for on-line training. It also does not account for integration with user behavior (i.e. linkage between user behavior and output), making it not as robust towards agent modeling.


Review 26

The paper proposes an algorithm for searching on the World Wide Web (WWW) through focusing on the use of links for analyzing the collection of pages relevant to a broad search topic and through discovering the most authoritative pages on such topic. The motivation behind it is that we are lacking objective functions that are both concretely defined and correspond to human notions of quality. How does a machine know if a page is relevant to user’s objective despite the fact that the page contains the string from the query?

For the early parts of the paper, it focuses on specific queries (“Does X support Y?”) and broad-topic queries (“Find information about Z”). For specific queries, the difficulties is to find the pages that satisfy it (scarcity problem), while for broad-topic queries, it is to determine which pages are relevant among abundance of pages (abundance problem). The writer proposes analysis of the link structure. The model is based on the relationship between authorities of topic and those pages that link to many related authorities (hubs), and it is done globally (WWW as a whole, not only a single page).

The next sections explain how the algorithm works. First, it constructs a focused subgraph of the WWW. It begins with generating a number of highest-ranked pages for the query based on query string from a text based search engine (called “root”). The root is then used to produce a set of pages that will satisfy the condition. The considered “related” pages will be added to the basic set of search result. Then, it computes hubs and authorities in the basic set of search result. It assigns and maintains authority weights to hub pages and authority pages.

Next, the paper widens the implementation of the algorithm to similar-page queries (“Find page similar to W”). It recognize the “similarity” by asking “in the local region of the link structure near the page W, what are the strongest authorities?” Such authorities can serve as a broad-topic summary of the page related to page W. The next section discusses connections with related work, through the notion of standing (importance), impact, and influence; taking the examples from social networks, scientific citation (bibliometrics), Hypertext&WWW rankings, and other link-based approaches. It also talks about clustering of the link structure. In the next section, it addresses the issue of multiple sets of hubs and authorities, usually for the cases where a query string has more than one interretation (i.e.: ‘apple’ can refer to Apple computers/products or Apple Records or the apple fruit). Interestingly, the algorithm will unconsciously groups pages that has correlated interpretation that is derived from their link structures (i.e.: a page about apple computers products usually links to other pages about apple computers, not about Apple Records).

The last two chapters talks about diffusion&generalization, and evaluation. Basically, the algorithm produces a focused subgraph that aims at ensuring that the most relevant such collection is also the densest one, and hence will be found by the method of iterated I and O operations. It addresses the issue of topics that is too narrow that it is difficult to find the related pages, such that it results in giving more broadly-related pages. For evaluation section, it talks about evaluating the algorithm (or how difficult it is to do that).

One of the contributions of this paper is proposing an algorithm tools for extracting information from the link structures in hyperlinked environment (in the context of WWW) by distilling the broad search of topic through discovering authoritative information sources of the topic. It is done without directly maintaining index of the WWW and its link structure.

However, I get the impression that the building of the graph is done every time a search is performed since it does not maintain index. Is there any chance to actually save the previous query strings and done analysis as to which queries can lead to similar result and then use the analysis for future search query? Another thing is, from reading the evaluation section, it seems like the performance of this algorithm depends on the algorithm used as the basis of collecting the root (basic result).



Review 27

The purpose of this paper is to provide a new algorithm for determining which pages are good results to return for a broad topic search query. The paper couches its examples in the WorldWideWeb, but state that there are applications in other graph-based domains as well. An additional purpose of this paper is to quantify what “authority” means for a web page, which is impressive since authority is such a nebulous concept.

The technical contributions of this paper are numerous. They present a method for determining the authority of web pages for given search results. They present an algorithm that first determines a relevant subgraph (thinking of the WWW as a large graphical structure where the edges correspond to links) that is based on global, not local, structure. They then grow that subgraph to include pages that are linked to by pages it contains, and then finally the algorithm determines the authority of various pages in this graph based on their incoming links and their quality. This is determined through an iterative algorithm that maintains both an authority weight and a hub weight to determine if a page is one or the other. The paper provides pseudocode for these algorithms as well as linear algebra-based motivations regarding matrices and eigenvalues. They also provide proofs of correctness and convergence of their algorithms.

I think the strengths of this paper are numerous. It does an excellent job discussing problems with existing methods and complications that must be addressed in their own algorithms. I think that, because of these detailed discussions, along with the proofs and linear-algebra-based examination, the algorithms carry great weight in their legitimacy. Additionally, the section in which they discuss which queries their algorithm performs best on (i.e. how broad does a topic have to be in order for their algorithm to perform well on it) provides an even more in-depth analysis of the usefulness of their results.

As far as weaknesses go, I think that this paper has few, if any. If i had to pick one, I would say that the positioning of the related works section is a bit strange. Though I think that the authors motivate the problem well in the introduction, the related work section could have been moved earlier in the paper to further motivate how their idea is different from or similar to existing techniques.



Review 28

This paper discussed a set of tools for discovering authoritative sources in broad search. The paper shows that it’s possible to locate high-quality information only by investigating the link topology. This technique is particularly useful in the context of WWW. The main application is on broad topic search, and it can also be applied to similar-page queries.

The algorithm is composed of two steps. The first step is to filter the result from full-text search. It picks the t highest-ranked pages for the query and iteratively grows the set by adding the pages that are pointed from or points to the set, with the restriction that a single page can bring at most d pages pointing to the set. After the subgraph G is constructed, it then finds the hubs and authorities. The authors observed that some pages are acting like hubs that points to a lot of pages with the same topic, just like an index, called hubs. And some pages are pointed from many hubs, having the authoritative information on a particular topic, called authorities. The algorithm iteratively walks through the subgraph, assigning each page’s authoritative value by summing its inbounds’ hub value, and assigning each page’s hub value by summing its outbounds’ authoritative value, until the value converges. At the end the pages are ranked with their authoritative value.

The main contribution of this paper is to cluster the web pages into hubs and authorities, and assign different type of rank based on their property. This approach is better than simple keyword-based ranking as some authoritative pages doesn’t have the keyword at all. For example the honda.com doesn’t have “automobile manufacture” on it.

One weakness of this approach is that it only fits into the category of broad topic search. However, when a use input search, how can a search engine detect that this is a broad search remains an open question. In my opinion it’s quite hard to tell whether a search is a specific question or a broad topic. And seems the author didn’t give an answer either.