An Empirical Study on the Effectiveness of Data Resampling Approaches for Cross-Project Software Defect Prediction
Kwabena Ebo Bennin, Amjed Tahir, Stephen G. MacDonell, J\"urgen, B\"orstler

TL;DR
This study evaluates how different data resampling techniques affect cross-project software defect prediction models, showing that resampling improves recall and g-measure but may reduce precision, guiding better model choices.
Contribution
It provides an empirical assessment of various oversampling and undersampling methods on CPDP models, highlighting their impact on prediction performance.
Findings
Data resampling improves recall and g-measure in CPDP.
Resampling can decrease precision and increase false alarms.
Different resampling methods have varying effects on model performance.
Abstract
Crossp-roject defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is highly skewed datasets where non buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
