Incorrect Data in the Widely Used Inside Airbnb Dataset
Abdulkareem Alsudais

TL;DR
This paper identifies and documents data quality issues in the widely used Inside Airbnb dataset, highlighting systemic errors caused by new features and discussing implications for research reproducibility.
Contribution
It provides the first thorough investigation of data validity issues in the Inside Airbnb dataset and links errors to specific systemic causes.
Findings
Incorrect data due to systemic errors identified
Data issues linked to new Airbnb features
Reproducibility problems between dataset releases
Abstract
Several recently published papers in Decision Support Systems discussed issues related to data quality in Information Systems research. In this short research note, I build on the work introduced in these papers and document two data quality issues discovered in a large open dataset commonly used in research. Inside Airbnb (IA) collects data from places and reviews as posted by users of Airbnb.com. Visitors can effortlessly download data collected by IA for several locations around the globe. While the dataset is widely used in academic research, no thorough investigation of the dataset and its validity has been conducted. This note examines the dataset and explains an issue of incorrect data added to the dataset. Findings suggest that this issue can be attributed to systemic errors in the data collection process. The results suggest that the use of unverified open datasets can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
