Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL)
Chenyu Li, Xia Jiang

TL;DR
This paper introduces a new evaluation framework for imbalanced data learning, revealing that traditional methods often give false performance improvements, and shows that data augmentation has limited benefits in healthcare datasets.
Contribution
The study proposes EFIDL, a novel evaluation framework for imbalanced data learning, addressing biases in traditional evaluation methods and providing more accurate assessments.
Findings
Traditional evaluation methods can falsely suggest performance improvements.
Data augmentation has limited effectiveness in improving ML performance on imbalanced healthcare data.
EFIDL provides a more reliable assessment of ML methods with augmented data.
Abstract
Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Artificial Intelligence in Healthcare · Medical Coding and Health Information
MethodsSynthetic Minority Over-sampling Technique.
