Boosting Statistic Learning with Synthetic Data from Pretrained Large   Models

Jialong Jiang; Wenkang Hu; Jian Huang; Yuling Jiao; Xu Liu

arXiv:2505.04992·stat.ML·May 9, 2025

Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

Jialong Jiang, Wenkang Hu, Jian Huang, Yuling Jiao, Xu Liu

PDF

Open Access

TL;DR

This paper introduces a framework that uses large pretrained generative models to produce and filter synthetic data, improving predictive modeling performance while addressing the limitations of generative data augmentation.

Contribution

The paper presents an end-to-end method for generating and filtering synthetic data from pretrained models, enhancing predictive accuracy with a systematic approach.

Findings

01

Synthetic data can improve predictive performance.

02

Filtering enhances the quality of synthetic data.

03

Limited proportion of synthetic data effectively boosts models.

Abstract

The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare

MethodsDiffusion