PRITES: An integrative framework for investigating and assessing web-scraped HTTP-response datasets for research applications
Cynthia A. Huang, Tina Lam

TL;DR
This paper introduces PRITES, an integrated framework for systematically documenting, assessing, and improving the quality of web-scraped datasets used in research, combining technical and statistical perspectives.
Contribution
The paper presents a novel, comprehensive framework (PRITES) that guides the entire process of web-scraped data collection and evaluation, bridging multiple disciplines.
Findings
Framework supports better documentation and quality assessment of web-scraped data.
Application of PRITES to retail price data demonstrates its practical utility.
Framework enhances transparency and reproducibility in web-scraped data research.
Abstract
The ability to programmatically retrieve vast quantities of data from online sources has given rise to increasing usage of web-scraped datasets for various purposes across government, industry and academia. Contemporaneously, there has also been growing discussion about the statistical qualities and limitations of collecting from online data sources and analysing web-scraped datasets. However, literature on web-scraping is distributed across computer science, statistical methodology and application domains, with distinct and occasionally conflicting definitions of web-scraping and conceptualisations of web-scraped data quality. This work synthesises technical and statistical concepts, best practices and insights across these relevant disciplines to inform documentation during web-scraping processes, and quality assessment of the resultant web-scraped datasets. We propose an integrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Consumer Market Behavior and Pricing · Data Stream Mining Techniques
