Real-Fake: Effective Training Data Synthesis Through Distribution   Matching

Jianhao Yuan; Jie Zhang; Shuyang Sun; Philip Torr; Bo Zhao

arXiv:2310.10402·cs.LG·March 21, 2024·6 cites

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Jianhao Yuan, Jie Zhang, Shuyang Sun, Philip Torr, Bo Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a distribution-matching framework for synthetic data generation that significantly improves deep model training, achieving high accuracy on ImageNet and enhancing generalization, privacy, and scalability.

Contribution

We propose a principled distribution-matching approach for synthetic data synthesis, providing theoretical insights and demonstrating substantial empirical improvements in deep learning tasks.

Findings

01

Achieved 70.9% top-1 accuracy on ImageNet with synthetic data alone.

02

Scaling synthetic data to 10x increases accuracy to 76%.

03

Synthetic data enhances out-of-distribution generalization and privacy.

Abstract

Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits such as out-of-distribution generalization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BAAI-DCAI/Training-Data-Synthesis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Privacy-Preserving Technologies in Data · Domain Adaptation and Few-Shot Learning