An Audit of Machine Learning Experiments on Software Defect Prediction

Giuseppe Destefanis; Leila Yousefi; Martin Shepperd; Allan Tucker; Stephen Swift; Steve Counsell; Mahir Arzoky

arXiv:2601.18477·cs.SE·January 27, 2026

An Audit of Machine Learning Experiments on Software Defect Prediction

Giuseppe Destefanis, Leila Yousefi, Martin Shepperd, Allan Tucker, Stephen Swift, Steve Counsell, Mahir Arzoky

PDF

Open Access

TL;DR

This paper audits recent machine learning studies on software defect prediction, revealing wide variability in experimental practices, limited reproducibility, and highlighting the need for improved reporting standards.

Contribution

It provides a comprehensive analysis of experimental design and reproducibility issues in recent SDP studies, offering insights for better research practices.

Findings

01

Approximately 1,585 experiments identified in recent literature.

02

Only about 45% of studies used formal statistical inference.

03

Reproducibility varied from near complete to severely limited.

Abstract

Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Imbalanced Data Classification Techniques · Software Engineering Techniques and Practices