Learning the Wrong Lessons: Inserting Trojans During Knowledge Distillation
Leonard Tang, Tom Shlomi, Alexander Cai

TL;DR
This paper demonstrates how Trojan attacks can be embedded during knowledge distillation, creating malicious models that degrade performance without affecting the teacher, highlighting a new security vulnerability.
Contribution
It introduces a novel Trojan attack method during knowledge distillation that reduces student accuracy without impacting the teacher, revealing a new threat in model training.
Findings
Trojan attacks can be embedded during knowledge distillation.
The attack reduces student model accuracy.
The attack does not alter teacher performance.
Abstract
In recent years, knowledge distillation has become a cornerstone of efficiently deployed machine learning, with labs and industries using knowledge distillation to train models that are inexpensive and resource-optimized. Trojan attacks have contemporaneously gained significant prominence, revealing fundamental vulnerabilities in deep learning models. Given the widespread use of knowledge distillation, in this work we seek to exploit the unlabelled data knowledge distillation process to embed Trojans in a student model without introducing conspicuous behavior in the teacher. We ultimately devise a Trojan attack that effectively reduces student accuracy, does not alter teacher performance, and is efficiently constructible in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsKnowledge Distillation
