Analysis of the Web Graph Aggregated by Host and Pay-Level Domain
Agostino Funel

TL;DR
This paper analyzes web graphs aggregated by host and pay-level domain, revealing power law distributions at the PLD level and providing insights into their structural properties and diameters.
Contribution
It offers a comprehensive statistical analysis of large-scale web graphs, highlighting the emergence of power laws at the PLD level and comparing distribution models.
Findings
Power law tails are present in PLD graphs for indegree and component sizes.
No power law tails are observed at the host level.
Estimated diameters of the web graphs are provided.
Abstract
In this paper the web is analyzed as a graph aggregated by host and pay-level domain (PLD). The web graph datasets, publicly available, have been released by the Common Crawl Foundation and are based on a web crawl performed during the period May-June-July 2017. The host graph has 1.3 billion nodes and 5.3 billion arcs. The PLD graph has 91 million nodes and 1.1 billion arcs. We study the distributions of degree and sizes of strongly/weakly connected components (SCC/WCC) focusing on power laws detection using statistical methods. The statistical plausibility of the power law model is compared with that of several alternative distributions. While there is no evidence of power law tails on host level, they emerge on PLD aggregation for indegree, SCC and WCC size distributions. Finally, we analyze distance-related features by studying the cumulative distributions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
