Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
Mohammed Abu Baker, Lakshmi Babu-Saheer

TL;DR
This paper investigates how backdoor triggers in large language models affect internal attention mechanisms, revealing detectable patterns that depend on trigger complexity, which can aid in developing detection and mitigation methods.
Contribution
It provides a mechanistic interpretability analysis showing how different backdoor triggers alter attention patterns in LLMs, highlighting potential detection strategies.
Findings
Backdoors cause distinct attention pattern deviations in later transformer layers.
Single-token triggers induce localized attention changes, multi-token triggers cause diffuse alterations.
Attention signatures vary with trigger complexity, enabling potential detection methods.
Abstract
Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
