Data Quality in Empirical Software Engineering: A Targeted Review
Michael Franklin Bosu, Stephen G. MacDonell

TL;DR
This paper reviews how empirical software engineering studies report on data quality, highlighting inconsistencies and advocating for better data collection, pre-processing, and quality assessment to improve model reliability.
Contribution
It provides a targeted review of data quality reporting in ESE, revealing gaps and emphasizing the need for standardized practices to enhance research robustness.
Findings
Only 23 of 221 studies reported all three data quality elements.
Data collection procedures are inconsistently documented.
Better data quality reporting can lead to more reliable models.
Abstract
Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
