Should we trust web-scraped data?
Jens Foerderer

TL;DR
This paper highlights the risks of sampling bias in web-scraped data used in econometrics and machine learning, emphasizing the importance of recognizing, detecting, and mitigating bias sources for reliable analysis.
Contribution
It identifies three main sources of sampling bias in web scraping—volatility, personalization, and unindexed content—and offers practical recommendations to address these issues.
Findings
Sampling bias can significantly distort web-scraped data.
Web content volatility, personalization, and unindexed data are key bias sources.
Guidelines are provided for bias detection and mitigation.
Abstract
The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that na\"ive web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConsumer Market Behavior and Pricing · Data Quality and Management · Survey Methodology and Nonresponse
