Paper title: Graph structure in the Web

This paper provides a description of a series of experiments on the web viewed as a big graph whose nodes are static pages, and edges are hyperlinks among them. The authors depict three experiments performed on two Altavista web-crawls from 1999: corroborating the exponential laws previously observed in nodes connection degree, figuring out the distribution of directed and undirected components in the web graph, and exploring and analyzing the properties of the distinct components that make up such a web graph. According to the authors, their study may help understand the intricate structure of the linkage of the web, predict its evolution over time, and analyze the behavior of algorithms that execute on top of it. Although previous work has been done on this area, the authors claim that their research is more complete in the sense that they do not focus on just a local (or bounded) portion of the web (like a corporate website); their experimentation is executed on an immensely large set of 203'546,046 pages with 1466 million links (later expanded to 271 million and 2130 million respectively). The paper results show that in- and out-degrees mostly follow an exponential law with corresponding exponents of 2.09 and 2.72, as the authors expected. Moreover, the sizes of the undirected connected components of the web graph also follow a power law. Finally, by resorting to BFS to explore the graph structure, the authors demonstrate that the web is not actually fully connected; it is partitioned into 5 regions: a strongly connected component (SCC), a component whose links point only to SCC but not vice versa (known as IN), a component to whom SCC points to but not vice versa (known as OUT), many tendrils pointed from IN or to OUT but that are not part of SCC, and various disconnected components. In other words, the web graph looks more distributive, like a complex organism with limbs rather than a unicellular creature that is reminiscent of compactness and full connectivity.

Paper title: Graph structure in the Web

This paper reported the experimental results using the data provided by AltaVista. It treated the web as a graph, in which pages are as nodes and links are as edges, ran random-start BFS onto that, used WCC algorithm to find the weak components and used SCC algorithm to find the strongly connected components. The paper divided the web into 5 regions, SCC, IN, OUT, TENDRILS and DISC, according to its algorithm, and estimated the numbers of pages at each regions using samples. It also exposed statistics when taking the web as a graph.

This paper has several advantages. First, it based its experiment on a very good assumption. It take the web as a graph, which is still useful on today's situation. Even Google's PageRank is initially based on the assumptions on this paper. Second, it creatively divide the graph structure into several parts, the IN, OUT, SCC and DISC parts are still applicable if we want to make a classification of all pages on the internet now.

However, the paper also has some drawbacks. First, the data it used is biased. According to the paper, the experimental data start with some major resources over the internet, making the possibility of the Strongly Connected Components higher. Second, the Random Starting BFS may miss many IN and DISC pages as they cannot be reached by the crawler if they are not at the starting point. Third, the IN, OUT, SCC, TRAIL and DISC partition is not a very concrete assumption of today's web. The actual structure of web is much more complex than simply those parts. In addition, there are much less IN and OUT pages, many of the webpages on the Internet now are connected with each other. Therefore, even graph is a very good assumption if we want to measure the structure of the web, we may use more advanced techniques, such as mathematical or graphic modeling, to analyze it.

Paper title: Graph structure in the Web

This paper studies the web as a graph, in which nodes (in directed graph) correspond to static pages on the Web, and arcs correspond to links between pages. It is very meaningful for designing Web algorithm to efficiently crawling, searching and community discovery, and also for better understanding the sociology phenomena. From this paper, we can learn some conclusions: 1) the distribution of degrees follows a power law: the probability that a node has in-degree i is proportional to 1/i^x, for some x > 1; 2) in- and out-degree distributions on the web are distributed confirming to power laws. In the in-degree case, the exponent is consistently around 2.1, while in out-degree case, the exponent is 2.72; 3) the study reveals that the web graph form a single connected component if it is treated as an undirected graph. The connectivity is extremely resilient and does not depend on the existence of nodes of high in-degree. And this connected Web can be divided into four parts-- SCC, IN, OUT, TENDRILS; 4) the size of SCC is small (about 56 million pages) and the other three sets contain about 44 million pages. The diameter of SCC is at least 28, and that the diameter of the graph is over 500. 5) when randomly select source and destination web pages, the probability of any path can be found from the source to destination is only 24%. And the average length will be about 16 if a directed path is found, while it will be 6 in the case of undirected path exists. Also, these results are insensitive to the particular crawl; 6) we can compute the size of SCC, IN, OUT, TENDRILS and DISCONNECTED by analyzing BFS algorithm twice in forward and backward direction from 570 random starting points; 7) when exploring in-links from a node in OUT or out-links from a node in IN,  we will meet approximately 100 million nodes. when exploring out-links from a node in OUT, we will encounter about 3094 nodes, and when following IN-links from a node in IN, we will encounter about 171 nodes. OUT tends to encounter larger neighborhoods.