TL;DR
The paper introduces the Off-Topic Memento Toolkit, a set of methods to identify and filter off-topic web archive mementos, improving the quality of web archive research and analysis.
Contribution
It presents a comprehensive toolkit with multiple similarity measures for detecting off-topic mementos, validated against a manually curated gold standard dataset.
Findings
Multiple similarity measures achieve high F1 scores in off-topic detection.
The toolkit enables effective filtering of irrelevant web archive content.
A gold standard dataset supports benchmarking detection methods.
Abstract
Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news story or the evolution of an organization. Unfortunately, over time, some of these original resources can go off-topic and no longer suit the purpose for which the collection was originally created. They can go off-topic due to web site redesigns, changes in domain ownership, financial issues, hacking, technical problems, or because their content has moved on from the original topic. Even though they are off-topic, the archiving system will still capture them, thus it becomes imperative to anyone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
