Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev

TL;DR
This paper introduces T-MTB, a novel backdoor technique that creates transferable triggers in language models, revealing security vulnerabilities in knowledge distillation processes used in LLMs.
Contribution
The paper presents T-MTB, a new backdoor method that constructs composite triggers, demonstrating transferable backdoors in LLMs and exposing security risks in knowledge distillation.
Findings
Transferable backdoors can be constructed using T-MTB.
Backdoors remain stealthy during training but are effective during distillation.
Security risks exist across multiple LLM families and attack scenarios.
Abstract
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Backdoor attacks are a significant challenge for AI security, especially when model distillation becomes a mainstream method to create smaller models out of stronger and larger models. 2. Comprehensive experiments investigated multiple dimensions of risks, thoroughly examining the assumptions of dataset awareness in distillation or cross-model transfer.
1. **Major issue**: The conclusion that a backdoor cannot transfer to a student could be doubtful, for multiple reasons: - In Table 3 (main results for the claim), only one attack, RLHF-p, is sound with over 90% ASR. Other attacks can barely be called as effective backdoors. - Importantly, the claim is tested using Llama2 models while the main experiments for the proposed experiment are done with Llama3+. - In later main experiments, T-MTB was used for backdooring teacher models (first
- Problem Motivation & Scope: The paper identifies a substantial gap in the current understanding of security risks in LLM knowledge distillation, moving beyond recently reported teacher-induced bias transfer to specifically examine backdoor persistence under realistic adversarial settings. This makes it a timely and practically relevant contribution. - Novel Attack Design (T-MTB): T-MTB proposes a clever trigger construction method using tokens frequently present individually in public distilla
- Insufficient Positioning vs. Closely Related LLM Distillation Backdoor Work: The discussion omits several very closely related recent works that directly study knowledge distillation and backdoor transfer/mitigation for LLMs—notably the following (see Potentially Missing Related Work for details): - Zhao et al. (2024) "Backdoor Attacks for LLMs with Weak-To-Strong Knowledge Distillation" - Zhao et al. (2024) "Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
+ The paper presents a timely security concern in LLM distillation, a setting gaining increasing practical relevance. + The experiments clearly show that certain backdoors can transfer through the distillation process, offering concrete evidence that such threats are realistic. + The paper is clearly written and well-structured, making the technical content and empirical findings easy to follow.
- The paper’s central weakness lies in its strong adversary assumption. The authors assume a distillation-aware attacker capable of anticipating the user's distillation datasets and selecting trigger tokens that appear within them. Although Section 4.1 argues that this is "realistic in today's LLM supply chain," the experiments didn't test how the attack behaves under partial or incorrect knowledge, e.g., when overlap between anticipated and actual corpora is limited or token co-occurrence stati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
