Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

Minghao Liu; Zonglin Di; Jiaheng Wei; Zhongruo Wang; Hengxiang Zhang; Ruixuan Xiao; Haoyu Wang; Jinlong Pang; Hao Chen; Ankit Shah; Hongxin Wei; Xinlei He; Zhaowei Zhao; Haobo Wang; Lei Feng; Jindong Wang; James Davis; Yang Liu

arXiv:2408.11338·cs.AI·April 21, 2026

Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu

PDF

1 Datasets

TL;DR

This paper introduces ADC, an automated, cost-effective method for large-scale dataset creation using LLMs, demonstrated by constructing a 1 million image clothing dataset with high label accuracy.

Contribution

The paper presents ADC, a novel automated dataset construction approach that reduces manual effort, improves label quality, and provides benchmark datasets for noise and bias detection.

Findings

01

ADC achieves 79% agreement with human annotations.

02

Reduces label noise from 22.2% to 10.7%.

03

Provides open-source tools and benchmarks for noisy and biased data learning.

Abstract

Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a challenge due to annotation errors, the substantial time and costs associated with human labor. To address these issues, we propose Automatic Dataset Construction (ADC), an innovative methodology that automates dataset creation with negligible cost and high efficiency. Taking the image classification task as a starting point, ADC leverages LLMs for the detailed class design and code generation to collect relevant samples via search engines, significantly reducing the need for manual annotation and speeding up the data generation process. To demonstrate ADC at scale, we construct Clothing-ADC: a dataset of over 1 million images spanning 12 main classes and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mikelmh025/ClothingADC
dataset· 669 dl
669 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.