From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying
Biao Wu, Qiang Huang, Anthony K. H. Tung

TL;DR
This paper introduces LDSS, a novel method for detecting leaked tabular data used in training models by injecting synthetic data with local distribution shifts and analyzing model predictions, effective across various models and tasks.
Contribution
The paper presents LDSS, a model-oblivious technique that detects data leaks through synthetic data injection and model querying, extending its application from classification to regression tasks.
Findings
LDSS effectively detects leaked data across multiple classification models.
The method demonstrates high reliability, robustness, and efficiency in experiments.
Extending LDSS to regression tasks shows its versatility and superior performance.
Abstract
Safeguarding the Intellectual Property (IP) of data has become critically important as machine learning applications continue to proliferate, and their success heavily relies on the quality of training data. While various mechanisms exist to secure data during storage, transmission, and consumption, fewer studies have been developed to detect whether they are already leaked for model training without authorization. This issue is particularly challenging due to the absence of information and control over the training process conducted by potential attackers. In this paper, we concentrate on the domain of tabular data and introduce a novel methodology, Local Distribution Shifting Synthesis (\textsc{LDSS}), to detect leaked data that are used to train classification models. The core concept behind \textsc{LDSS} involves injecting a small volume of synthetic data--characterized by local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Digital and Cyber Forensics
