Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
Jialong Jiang, Wenkang Hu, Jian Huang, Yuling Jiao, Xu Liu

TL;DR
This paper introduces a framework that uses large pretrained generative models to produce and filter synthetic data, improving predictive modeling performance while addressing the limitations of generative data augmentation.
Contribution
The paper presents an end-to-end method for generating and filtering synthetic data from pretrained models, enhancing predictive accuracy with a systematic approach.
Findings
Synthetic data can improve predictive performance.
Filtering enhances the quality of synthetic data.
Limited proportion of synthetic data effectively boosts models.
Abstract
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare
MethodsDiffusion
