Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software, Guidelines and Future Directions
William C. Sleeman IV, Bartosz Krawczyk

TL;DR
This paper provides a comprehensive taxonomy and evaluation of oversampling algorithms tailored for imbalanced big data in distributed environments, specifically using Apache Spark, and offers guidelines for future algorithm design.
Contribution
It introduces a Spark library with 14 oversampling algorithms, evaluates their performance on large datasets, and formulates design guidelines for scalable imbalanced data handling.
Findings
Oversampling algorithms vary in effectiveness depending on classifier type.
Trade-offs exist between accuracy, time complexity, and scalability.
Guidelines for designing future oversampling methods for big data are proposed.
Abstract
Learning from imbalanced data is among the most challenging areas in contemporary machine learning. This becomes even more difficult when considered the context of big data that calls for dedicated architectures capable of high-performance processing. Apache Spark is a highly efficient and popular architecture, but it poses specific challenges for algorithms to be implemented for it. While oversampling algorithms are an effective way for handling class imbalance, they have not been designed for distributed environments. In this paper, we propose a holistic look on oversampling algorithms for imbalanced big data. We discuss the taxonomy of oversampling algorithms and their mechanisms used to handle skewed class distributions. We introduce a Spark library with 14 state-of-the-art oversampling algorithms implemented and evaluate their efficacy via extensive experimental study. Using binary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Anomaly Detection Techniques and Applications · Electricity Theft Detection Techniques
