Can Common Crawl reliably track persistent identifier (PID) use over   time?

Henry S. Thompson; Jian Tong

arXiv:1802.01424·cs.DL·February 6, 2018·1 cites

Can Common Crawl reliably track persistent identifier (PID) use over time?

Henry S. Thompson, Jian Tong

PDF

Open Access

TL;DR

This paper evaluates the reliability of using Common Crawl data to track persistent identifier usage over time, highlighting tooling challenges and proposing solutions for longitudinal web analysis.

Contribution

It provides an empirical assessment of Common Crawl's suitability for longitudinal studies of persistent identifiers and discusses necessary tooling improvements.

Findings

01

Identified issues with Common Crawl data reliability for longitudinal analysis

02

Analyzed over 10^12 URIs across multiple years

03

Suggested specific actions to improve data usability for research

Abstract

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over $1 0^{12}$ URIs from over $5 * 1 0^{9}$ pages crawled in April 2014 and April 2017, the second study adds a further $3 * 1 0^{9}$ pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Web Data Mining and Analysis · Scientific Computing and Data Management