Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Yicheng Chen, Xiangtai Li, Yining Li, Yanhong Zeng, Jianzong Wu,, Xiangyu Zhao, Kai Chen

TL;DR
Auto Cherry-Picker (ACP) leverages large language models and diffusion models to generate high-quality, diverse synthetic data for perception tasks, improving model performance especially on imbalanced datasets.
Contribution
The paper introduces ACP, a novel framework that automatically generates high-quality synthetic data using language and diffusion models, with a new metric for quality assessment, enhancing perception tasks.
Findings
ACP significantly improves downstream perception model performance.
A positive correlation exists between CLIS scores and task performance.
Synthetic data helps address long-tailed and imbalanced dataset challenges.
Abstract
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks. However, leveraging these models to boost performance on downstream tasks with synthetic data poses several challenges, including aligning with real data distribution, scaling synthetic sample volumes, and ensuring their quality. To bridge these gaps, we present \textbf{A}uto \textbf{C}herry-\textbf{P}icker (ACP), a novel framework that generates high-quality cross-modality training samples at scale to augment perception and multi-modal training. ACP first uses LLMs to sample descriptions and layouts based on object combinations from real data priors, eliminating the need for ground truth image captions or annotations. Next, we use an off-the-shelf controllable diffusion model to generate multiple images. Then, the generated data are refined using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Semantic Web and Ontologies
MethodsDiffusion
