CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training
Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang

TL;DR
This paper reveals that subset training in machine learning can inadvertently leak sensitive information through choices made during data selection, challenging the assumption that fewer training samples reduce privacy risks.
Contribution
It introduces CoLA, a framework for analyzing privacy leakage in subset training, and demonstrates new privacy risks in vision and language models.
Findings
Subset selection can leak sensitive data via side-channel metadata.
Existing threat models underestimate privacy risks in subset training.
Privacy risks extend beyond individual models to the entire ML ecosystem.
Abstract
Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
