Data Quality in Empirical Software Engineering: A Targeted Review

Michael Franklin Bosu; Stephen G. MacDonell

arXiv:2105.10895·cs.SE·May 25, 2021

Data Quality in Empirical Software Engineering: A Targeted Review

Michael Franklin Bosu, Stephen G. MacDonell

PDF

TL;DR

This paper reviews how empirical software engineering studies report on data quality, highlighting inconsistencies and advocating for better data collection, pre-processing, and quality assessment to improve model reliability.

Contribution

It provides a targeted review of data quality reporting in ESE, revealing gaps and emphasizing the need for standardized practices to enhance research robustness.

Findings

01

Only 23 of 221 studies reported all three data quality elements.

02

Data collection procedures are inconsistently documented.

03

Better data quality reporting can lead to more reliable models.

Abstract

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.