Unlocking Post-hoc Dataset Inference with Synthetic Data

Bihe Zhao; Pratyush Maini; Franziska Boenisch; Adam Dziedzic

arXiv:2506.15271·cs.LG·June 19, 2025

Unlocking Post-hoc Dataset Inference with Synthetic Data

Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

PDF

Open Access 1 Video

TL;DR

This paper introduces a method for dataset inference that uses synthetically generated data to verify if a dataset was used in training large language models, addressing the challenge of lacking in-distribution held-out data.

Contribution

We propose a novel approach that generates synthetic data and calibrates likelihoods to enable dataset inference without requiring real held-out data.

Findings

01

High-confidence detection of training datasets

02

Low false positive rate in diverse text datasets

03

Effective for real-world copyright enforcement

Abstract

The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unlocking Post-hoc Dataset Inference with Synthetic Data· slideslive

Taxonomy

TopicsMedical Imaging Techniques and Applications · Traffic Prediction and Management Techniques · Anomaly Detection Techniques and Applications