TL;DR
This paper evaluates the quality of 13 datasets used in software effort estimation research, highlighting data quality issues and proposing a benchmarking template to improve dataset utility and data collection practices.
Contribution
It provides a systematic assessment of dataset quality in ESE and introduces a benchmarking template to enhance data collection and evaluation.
Findings
Identified prevalent data quality issues in commonly used datasets.
Assessed the fitness for purpose of these datasets.
Proposed a benchmarking template for dataset evaluation.
Abstract
Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
