The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models
Chakkrit Tantithamthavorn, Ahmed E. Hassan, Kenichi Matsumoto

TL;DR
This study evaluates how class rebalancing techniques affect defect prediction models' performance and interpretability across diverse datasets, providing practical guidelines for their use based on empirical evidence.
Contribution
It systematically assesses four popular rebalancing techniques on multiple performance metrics and offers insights into when these techniques are beneficial or detrimental.
Findings
Rebalancing improves recall but hampers model interpretability.
Rebalancing has no significant impact on AUC.
Use AUC as a standard measure for model comparison.
Abstract
Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models. In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models. We also construct statistical models to better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Imbalanced Data Classification Techniques
