The impact of using biased performance metrics on software defect prediction research
Jingxiu Yao, Martin Shepperd

TL;DR
This study reveals that the widespread use of the biased F1 metric in software defect prediction research can lead to significant errors, with over 20% of results changing direction when using an unbiased metric like MCC.
Contribution
The paper systematically compares F1 and MCC metrics in defect prediction studies, highlighting the impact of metric choice on research validity and urging adoption of unbiased metrics.
Findings
Over 21% of results changed direction when switching from F1 to MCC.
F1 remains widely used despite its known biases.
Using unbiased metrics can improve the validity of defect prediction research.
Abstract
Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately, some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the un-biased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability
