Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Th\'eo Lasnier, Wissam Antoun, Francis Kulumba, Djam\'e Seddah

TL;DR
This paper provides a mechanistic analysis of language-switching backdoors in large language models, revealing that triggers co-opt existing language circuits rather than forming isolated pathways, which informs future detection and mitigation strategies.
Contribution
It is the first to analyze the internal mechanisms of language-switching backdoors in LLMs, showing triggers overlap with natural language processing components.
Findings
Trigger formation occurs in early layers (7.5-25% depth).
Trigger-activated heads overlap with natural language encoding heads.
Backdoors co-opt existing language circuits rather than forming isolated pathways.
Abstract
Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
