CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction
Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

TL;DR
This paper introduces CopulaSMOTE, a novel copula-based data augmentation method that preserves dependency structures for imbalanced classification in diabetes prediction, outperforming traditional SMOTE.
Contribution
It is the first to utilize A2 copulas for data augmentation, providing an effective alternative to SMOTE in imbalanced healthcare datasets.
Findings
Random Forest with A2 copula oversampling achieved the highest performance.
CopulaSMOTE improved accuracy, precision, recall, F1-score, and AUC over SMOTE.
Statistical validation confirmed the significance of results.
Abstract
Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied five machine learning algorithms: logistic regression, random forest, gradient boosting, extreme gradient boosting, and Multilayer Perceptron. Overall, our findings show that Random Forest with A2 copula oversampling (theta = 10) achieved the best performance, with improvements of 5.3% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSynthetic Minority Over-sampling Technique.
