Web Archive Analytics
Michael V\"olske, Janek Bevendorff, Johannes Kiesel, Benno Stein, Maik, Fr\"obe, Matthias Hagen, Martin Potthast

TL;DR
This paper discusses the challenges of web archive analytics, compares the scale of web data to other datasets, and describes infrastructure for processing and analyzing large-scale web archive data from Internet Archive and Common Crawl.
Contribution
It introduces a comprehensive framework for understanding web archive data and details infrastructure for processing and analyzing large-scale web archives for research.
Findings
Relation of the Global Datasphere to other data sets
Agreement with Internet Archive for data access
Infrastructure for hosting 8 PB of web archive data
Abstract
Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Research Data Management Practices
