Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)
Amritanshu Agrawal, Tim Menzies

TL;DR
This paper demonstrates that tuning SMOTE with SMOTUNED significantly improves defect prediction accuracy in software analytics, emphasizing the importance of data pre-processing over classifier choice.
Contribution
It introduces SMOTUNED, a self-tuning version of SMOTE, and shows its effectiveness in enhancing defect prediction across multiple criteria and datasets.
Findings
SMOTUNED increases AUC by 60% and recall by 20%.
Data pre-processing can outweigh classifier selection in defect prediction.
SMOTUNED outperforms recent class imbalance techniques.
Abstract
We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Imbalanced Data Classification Techniques
