Can Common Crawl reliably track persistent identifier (PID) use over time?
Henry S. Thompson, Jian Tong

TL;DR
This paper evaluates the reliability of using Common Crawl data to track persistent identifier usage over time, highlighting tooling challenges and proposing solutions for longitudinal web analysis.
Contribution
It provides an empirical assessment of Common Crawl's suitability for longitudinal studies of persistent identifiers and discusses necessary tooling improvements.
Findings
Identified issues with Common Crawl data reliability for longitudinal analysis
Analyzed over 10^12 URIs across multiple years
Suggested specific actions to improve data usability for research
Abstract
We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over URIs from over pages crawled in April 2014 and April 2017, the second study adds a further pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Scientific Computing and Data Management
