ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph
Ahmed AlSum, Michael L. Nelson

TL;DR
ArcLink is a system that enhances web archiving by optimizing the construction, storage, and retrieval of the temporal web graph, enabling richer metadata access and supporting applications like PageRank and link analysis.
Contribution
It introduces optimization techniques for building and accessing the temporal web graph, extending web archive interfaces with metadata retrieval capabilities.
Findings
Improved efficiency in web graph construction and storage.
Enhanced API support for metadata retrieval like inlinks and outlinks.
Demonstrated applications such as PageRank computation.
Abstract
Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Graph Theory and Algorithms · Data Management and Algorithms
