Scraping SERPs for Archival Seeds: It Matters When You Start

Alexander C. Nwala; Michele C. Weigle; Michael L. Nelson

arXiv:1805.10260·cs.DL·June 11, 2018

Scraping SERPs for Archival Seeds: It Matters When You Start

Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

PDF

Open Access 1 Repo

TL;DR

This study investigates how quickly news stories disappear from Google search results and the challenges in retrieving the same URIs over time, emphasizing the importance of early and continuous collection for event-based archiving.

Contribution

It provides empirical data on the rate of news story replacement on Google SERPs and the probability of rediscovering URIs over time, highlighting the urgency in collection efforts.

Findings

01

Stories are replaced rapidly on Google SERPs, with weekly replacement rates up to 0.79.

02

The probability of rediscovering the same URI drops sharply after one week, down to 0.01-0.11.

03

Early and persistent collection is crucial for capturing the full evolution of news events.

Abstract

Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anwala/SERPRefind
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Data Quality and Management · Semantic Web and Ontologies