Dark Distillation: Backdooring Distilled Datasets without Accessing Raw   Data

Ziyuan Yang; Ming Yan; Yi Zhang; Joey Tianyi Zhou

arXiv:2502.04229·cs.CR·February 7, 2025

Dark Distillation: Backdooring Distilled Datasets without Accessing Raw Data

Ziyuan Yang, Ming Yan, Yi Zhang, Joey Tianyi Zhou

PDF

Open Access

TL;DR

This paper reveals that distilled datasets, used for efficient data sharing, are vulnerable to backdoor attacks even without access to raw data, by reconstructing class archetypes and injecting malicious triggers.

Contribution

It introduces a novel backdoor attack method on distilled datasets that does not require raw data access, demonstrating high vulnerability and efficiency.

Findings

01

Distilled datasets are highly susceptible to backdoor attacks.

02

Attack can be performed without raw data access.

03

The method is efficient, taking less than one minute in some cases.

Abstract

Dataset distillation (DD) enhances training efficiency and reduces bandwidth by condensing large datasets into smaller synthetic ones. It enables models to achieve performance comparable to those trained on the raw full dataset and has become a widely adopted method for data sharing. However, security concerns in DD remain underexplored. Existing studies typically assume that malicious behavior originates from dataset owners during the initial distillation process, where backdoors are injected into raw datasets. In contrast, this work is the first to address a more realistic and concerning threat: attackers may intercept the dataset distribution process, inject backdoors into the distilled datasets, and redistribute them to users. While distilled datasets were previously considered resistant to backdoor attacks, we demonstrate that they remain vulnerable to such attacks. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification