Statistical Undersampling with Mutual Information and Support Points

Alex Mak; Shubham Sahoo; Shivani Pandey; Yidan Yue; Linglong Kong

arXiv:2412.14527·stat.ML·December 20, 2024·2 cites

Statistical Undersampling with Mutual Information and Support Points

Alex Mak, Shubham Sahoo, Shivani Pandey, Yidan Yue, Linglong Kong

PDF

Open Access

TL;DR

This paper introduces two novel undersampling methods based on mutual information and support points to improve classification performance on imbalanced datasets, showing superior results over traditional techniques.

Contribution

The work presents innovative undersampling approaches that leverage statistical concepts to enhance data representativeness and classification accuracy in imbalanced datasets.

Findings

01

Outperforms traditional undersampling methods in accuracy

02

Effective in reducing information loss

03

Improves balanced classification performance

Abstract

Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurvey Sampling and Estimation Techniques · Advanced Statistical Methods and Models · Bayesian Methods and Mixture Models