CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Agnideep Aich; Md Monzur Murshed; Sameera Hewage; Amanda Mayeaux

arXiv:2506.17326·cs.LG·September 26, 2025

CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

PDF

TL;DR

This paper introduces CopulaSMOTE, a novel copula-based data augmentation method that preserves dependency structures for imbalanced classification in diabetes prediction, outperforming traditional SMOTE.

Contribution

It is the first to utilize A2 copulas for data augmentation, providing an effective alternative to SMOTE in imbalanced healthcare datasets.

Findings

01

Random Forest with A2 copula oversampling achieved the highest performance.

02

CopulaSMOTE improved accuracy, precision, recall, F1-score, and AUC over SMOTE.

03

Statistical validation confirmed the significance of results.

Abstract

Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied five machine learning algorithms: logistic regression, random forest, gradient boosting, extreme gradient boosting, and Multilayer Perceptron. Overall, our findings show that Random Forest with A2 copula oversampling (theta = 10) achieved the best performance, with improvements of 5.3% in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSynthetic Minority Over-sampling Technique.