Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling
Ivona Krchova, Michael Platzer, Paul Tiwald

TL;DR
This paper evaluates the use of open-source synthetic data generated by the Synthetic Data SDK to improve predictive modeling on highly unbalanced tabular datasets, showing enhanced accuracy for minority classes.
Contribution
It provides a benchmark study demonstrating that synthetic data upsampling with an open-source tool outperforms traditional methods in highly unbalanced tabular data scenarios.
Findings
Synthetic data improves minority class prediction accuracy.
Synthetic upsampling outperforms naive oversampling and SMOTE-NC.
Open-source synthetic data SDK is effective for mixed-type data.
Abstract
Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques
