From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
Yan Liang, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou, Yi Zhang

TL;DR
This paper introduces SubPopMark, a harmless, subpopulation-driven dataset protection framework that verifies copyright and traces data provenance without security concerns, based on model prediction biases.
Contribution
The paper proposes a novel protection method for distilled datasets using subpopulation biases, enabling black-box verification and tracing without malicious behaviors.
Findings
SubPopMark effectively verifies dataset copyright.
It enables user-specific data tracing.
The method preserves original dataset utility.
Abstract
Large-scale datasets have been a key driving force behind the rapid progress of deep learning, but their storage, computational, and energy costs have become increasingly prohibitive. Dataset distillation (DD) mitigates this problem by synthesizing compact yet informative datasets, thereby enabling efficient model training and storage. However, the ease of copying and distributing distilled datasets introduces serious risks of copyright infringement and data leakage. Existing protection methods are primarily designed for raw datasets rather than distilled datasets, and typically rely on backdoor-triggered malicious behaviors, which may raise security concerns. In this paper, we observe that deep neural networks tend to memorize subpopulation distributions during training, resulting in a systematic prediction bias, where models perform better on samples aligned with memorized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
