DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders
Sizai Hou, Songze Li, Duanyi Yao

TL;DR
DeDe is a novel detection method that identifies backdoor attacks in SSL encoders by training decoders to generate outputs differing from triggered inputs, effectively detecting stealthy backdoors in contrastive learning and CLIP models.
Contribution
DeDe introduces a decoder-based detection mechanism for SSL encoders that effectively identifies backdoor triggers by analyzing discrepancies between inputs and decoded outputs.
Findings
DeDe achieves high detection accuracy against various backdoor attacks.
It outperforms existing detection methods in empirical evaluations.
DeDe works effectively on both contrastive learning and CLIP models.
Abstract
Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders associate triggered inputs with target embeddings, e.g., mapping a triggered cat image to an airplane embedding, such that the downstream tasks inherit unintended behaviors when the trigger is activated. Emerging backdoor attacks have shown great threats across different SSL paradigms such as contrastive learning and CLIP, yet limited research is devoted to defending against such attacks, and existing defenses fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DeDe, which detects the activation of backdoor mappings caused by triggered inputs on victim…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Digital Rights Management and Security
MethodsContrastive Learning · Contrastive Language-Image Pre-training
