A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

TL;DR
This paper introduces a novel copula-based supervised filter for feature selection that emphasizes extreme value associations, improving interpretability and efficiency in diabetes risk prediction models.
Contribution
It proposes a new tail-focused feature ranking method using a Gumbel-copula implied concordance score, outperforming or matching standard methods in large-scale and clinical datasets.
Findings
Reduces features by approximately 52% on CDC dataset
Achieves the highest ROC-AUC on PIMA dataset
Provides a fast, interpretable feature screening method
Abstract
Effective feature selection is critical for robust and interpretable predictive modeling in medicine, especially when risk factors matter most in extreme patient strata. Many standard selectors emphasize average associations and can miss predictors whose relevance is concentrated in the distribution tails. We propose a computationally efficient supervised filter based on a Gumbel-copula implied upper-tail concordance score (lambda U), defined as a monotone transformation of Kendall's tau, to rank features by their tendency to be simultaneously extreme with the positive class. We compare against four common baselines (Mutual Information, mRMR, ReliefF, and L1/Elastic-Net) across four classifiers on two diabetes datasets: a large-scale public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Analyses include statistical testing, permutation importance, and robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
