Beyond time delays: How web scraping distorts measures of online news consumption
Roberto Ulloa, Frank Mangold, Felix Schmidt, Judith Gilsbach,, Sebastian Stier

TL;DR
This paper examines how web scraping methods introduce biases in measuring online news consumption, highlighting significant discrepancies from in-situ data and proposing strategies to mitigate these errors for more accurate research outcomes.
Contribution
It identifies the primary sources of bias in web scraping of browsing data and offers recommendations to improve data accuracy in digital behavioral research.
Findings
Ex-situ scraping causes ~33.8% of measurement errors.
Time delays contribute an additional ~6.5% error at 90 days.
Content discrepancies vary across news categories and classification methods.
Abstract
As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping, and more specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (in-situ) and content scraped using the URLs of participants' logged visits (ex-situ) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in existing literature. The ex-situ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia Influence and Politics
