Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage   in Downstream Applications

Yixin Wu; Ziqing Yang; Yun Shen; Michael Backes; Yang Zhang

arXiv:2502.00808·cs.LG·February 4, 2025

Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications

Yixin Wu, Ziqing Yang, Yun Shen, Michael Backes, Yang Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a framework for auditing artifacts to determine if they are derived from synthetic data generated by large language models, aiming to improve transparency and reduce risks in downstream applications.

Contribution

It proposes the first synthetic artifact auditing framework with three methods that do not require proprietary training details, validated across multiple tasks and scenarios.

Findings

01

High accuracy in identifying synthetic artifacts (up to 0.88) with minimal queries.

02

Effective across text classification, summarization, and visualization tasks.

03

Enhances transparency and supports ethical use of synthetic data.

Abstract

Large language models (LLMs) have facilitated the generation of high-quality, cost-effective synthetic data for developing downstream models and conducting statistical analyses in various domains. However, the increased reliance on synthetic data may pose potential negative impacts. Numerous studies have demonstrated that LLM-generated synthetic data can perpetuate and even amplify societal biases and stereotypes, and produce erroneous outputs known as ``hallucinations'' that deviate from factual knowledge. In this paper, we aim to audit artifacts, such as classifiers, generators, or statistical plots, to identify those trained on or derived from synthetic data and raise user awareness, thereby reducing unexpected consequences and risks in downstream applications. To this end, we take the first step to introduce synthetic artifact auditing to assess whether a given artifact is derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trustairlab/synthetic_artifact_auditing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing