Scraping SERPs for Archival Seeds: It Matters When You Start
Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

TL;DR
This study investigates how quickly news stories disappear from Google search results and the challenges in retrieving the same URIs over time, emphasizing the importance of early and continuous collection for event-based archiving.
Contribution
It provides empirical data on the rate of news story replacement on Google SERPs and the probability of rediscovering URIs over time, highlighting the urgency in collection efforts.
Findings
Stories are replaced rapidly on Google SERPs, with weekly replacement rates up to 0.79.
The probability of rediscovering the same URI drops sharply after one week, down to 0.01-0.11.
Early and persistent collection is crucial for capturing the full evolution of news events.
Abstract
Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Semantic Web and Ontologies
