Decoding the structure of the WWW: facts versus sampling biases
M. Angeles Serrano, Ana Maguitman, Marian Boguna, Santo Fortunato,, Alessandro Vespignani

TL;DR
This paper analyzes the topological properties of different Web graphs, revealing how sampling biases influence observed network features and identifying reciprocal connections as key indicators of the Web's structure.
Contribution
It provides a detailed statistical analysis of multiple Web graphs, highlighting the impact of sampling biases and proposing reciprocal connections as a discriminating observable.
Findings
Sampling biases significantly affect Web graph properties.
Reciprocal connections capture essential topological information.
Degree correlations are influenced by the sampling process.
Abstract
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been tackled recently by characterizing the properties of its representative graphs in which vertices and directed edges are identified with web-pages and hyperlinks, respectively. Data gathered in large scale crawls have been analyzed by several groups resulting in a general picture of the WWW that encompasses many of the complex properties typical of rapidly evolving networks. In this paper, we report a detailed statistical analysis of the topological properties of four different WWW graphs obtained with different crawlers. We find that, despite the very large size of the samples, the statistical measures characterizing these graphs differ quantitatively, and in some cases qualitatively, depending on the domain analyzed and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb visibility and informetrics · Web Data Mining and Analysis · Advanced Text Analysis Techniques
