Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada; Yichen Xu; Lexin Li; Linjun Zhang

arXiv:2406.03628·stat.ML·February 10, 2026

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

PDF

Open Access 1 Repo

TL;DR

This paper develops theoretical foundations for using large language models to generate synthetic data for addressing class imbalance and spurious correlations, supported by extensive experiments.

Contribution

It introduces a systematic theoretical analysis of synthetic oversampling with LLMs, including benefits quantification, scaling laws, and sample quality assessment.

Findings

01

Synthetic oversampling benefits are explicitly quantified.

02

Scaling laws for data augmentation are derived.

03

High-quality synthetic samples can be generated by transformer models.

Abstract

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Xyc-arch/OPAL-OversamPling-with-Artificial-LLM-generated-data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques

MethodsFocus