Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning
Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang

TL;DR
This paper uncovers privacy vulnerabilities in dataset pruning for machine learning, demonstrating that redundant data can be inferred even before model training, and introduces a new data-centric inference paradigm with practical attack methods.
Contribution
It introduces Data Lineage Inference (DaLI), a novel paradigm for privacy inference in dataset pruning, along with four new attack methods and a metric for privacy risk assessment.
Findings
Redundant data can be identified before model training.
Different pruning methods have varying privacy risks.
A new metric, Brimming score, guides privacy-preserving pruning.
Abstract
In this work, we systematically explore the data privacy issues of dataset pruning in machine learning systems. Our findings reveal, for the first time, that even if data in the redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. Since this is a fully upstream process before model training, traditional model output-based privacy inference methods are completely unsuitable. To address this, we introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI). Under this paradigm, four threshold-based attacks are proposed, named WhoDis, CumDis, ArraDis and SpiDis. We show that even without access to downstream models, adversaries can accurately identify the redundant set with only limited prior knowledge. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Digital and Cyber Forensics
MethodsDataset Pruning · Sparse Evolutionary Training · Pruning
