Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling

Ivona Krchova; Michael Platzer; Paul Tiwald

arXiv:2507.16419·cs.LG·July 23, 2025

Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling

Ivona Krchova, Michael Platzer, Paul Tiwald

PDF

Open Access

TL;DR

This paper evaluates the use of open-source synthetic data generated by the Synthetic Data SDK to improve predictive modeling on highly unbalanced tabular datasets, showing enhanced accuracy for minority classes.

Contribution

It provides a benchmark study demonstrating that synthetic data upsampling with an open-source tool outperforms traditional methods in highly unbalanced tabular data scenarios.

Findings

01

Synthetic data improves minority class prediction accuracy.

02

Synthetic upsampling outperforms naive oversampling and SMOTE-NC.

03

Open-source synthetic data SDK is effective for mixed-type data.

Abstract

Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques