A Taxonomy of Data Quality Challenges in Empirical Software Engineering

Michael Franklin Bosu; Stephen G. MacDonell

arXiv:2106.06141·cs.SE·June 14, 2021

A Taxonomy of Data Quality Challenges in Empirical Software Engineering

Michael Franklin Bosu, Stephen G. MacDonell

PDF

TL;DR

This paper presents a comprehensive taxonomy of data quality challenges in empirical software engineering, highlighting issues affecting data fitness, transferability, and accessibility, and reviews current assessment and mitigation techniques.

Contribution

It introduces a detailed taxonomy categorizing data quality issues in empirical SE and reviews existing assessment methods and solutions for each category.

Findings

01

Data quality issues impact model accuracy and reliability.

02

Current assessment techniques address some challenges but gaps remain.

03

Accessibility and trust in data require further research.

Abstract

Reliable empirical models such as those used in software effort estimation or defect prediction are inherently dependent on the data from which they are built. As demands for process and product improvement continue to grow, the quality of the data used in measurement and prediction systems warrants increasingly close scrutiny. In this paper we propose a taxonomy of data quality challenges in empirical software engineering, based on an extensive review of prior research. We consider current assessment techniques for each quality issue and proposed mechanisms to address these issues, where available. Our taxonomy classifies data quality issues into three broad areas: first, characteristics of data that mean they are not fit for modeling; second, data set characteristics that lead to concerns about the suitability of applying a given model to another data set; and third, factors that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.