Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
Yixin Wu, Ziqing Yang, Yun Shen, Michael Backes, Yang Zhang

TL;DR
This paper introduces a framework for auditing artifacts to determine if they are derived from synthetic data generated by large language models, aiming to improve transparency and reduce risks in downstream applications.
Contribution
It proposes the first synthetic artifact auditing framework with three methods that do not require proprietary training details, validated across multiple tasks and scenarios.
Findings
High accuracy in identifying synthetic artifacts (up to 0.88) with minimal queries.
Effective across text classification, summarization, and visualization tasks.
Enhances transparency and supports ethical use of synthetic data.
Abstract
Large language models (LLMs) have facilitated the generation of high-quality, cost-effective synthetic data for developing downstream models and conducting statistical analyses in various domains. However, the increased reliance on synthetic data may pose potential negative impacts. Numerous studies have demonstrated that LLM-generated synthetic data can perpetuate and even amplify societal biases and stereotypes, and produce erroneous outputs known as ``hallucinations'' that deviate from factual knowledge. In this paper, we aim to audit artifacts, such as classifiers, generators, or statistical plots, to identify those trained on or derived from synthetic data and raise user awareness, thereby reducing unexpected consequences and risks in downstream applications. To this end, we take the first step to introduce synthetic artifact auditing to assess whether a given artifact is derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing
