Temporally Extending Existing Web Archive Collections for Longitudinal Analysis
Lesley Frew, Michael L. Nelson, Michele C. Weigle

TL;DR
This paper presents a methodology to extend existing web archive collections temporally, enabling longitudinal analysis of website content changes over multiple administrations, demonstrated through environmental policy websites from 2008 to 2020.
Contribution
The paper introduces a novel approach for extending web archive collections backward in time, creating a dataset that supports comprehensive longitudinal analysis.
Findings
81% of pages changed between 2008 and 2020
87% of terms deleted by Trump were added during Obama
Extended dataset enabled new longitudinal insights
Abstract
The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, Were the website terms deleted by the Trump administration (2017--2021) added by the Obama administration (2009--2017)? Thus, like many researchers using the Wayback Machine's holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis and Archiving · Recommender Systems and Techniques · Advanced Data Compression Techniques
