Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline
Shivani Kapania, Stephanie Ballard, Alex Kessler, Jennifer Wortman Vaughan

TL;DR
This paper explores how synthetic data, generated by auxiliary models, is increasingly integral to AI development, highlighting current practices, challenges, and ethical considerations through practitioner interviews.
Contribution
It provides an empirical analysis of synthetic data use in AI, identifying challenges and proposing steps for responsible practices based on industry insights.
Findings
Auxiliary models are widely used across AI pipelines.
Synthetic data helps address data scarcity and enhances competitiveness.
Challenges include controlling outputs, representing underrepresented groups, and scaling validation.
Abstract
Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model (which we refer to as an auxiliary model) to generate synthetic data that is used to train or evaluate another, reconfiguring AI workflows and reshaping the very nature of data. While scholars have raised concerns over the risks of synthetic data, policy guidance and best practices for its responsible use have not kept up with these rapidly evolving industry trends, in part because we lack a clear picture of current practices and challenges. Our work aims to address this gap. Through 29 interviews with AI practitioners and responsible AI experts, we examine the expanding role of synthetic data in AI development. Our findings reveal how auxiliary models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
