When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, Ruixiang Tang

TL;DR
This paper explores how LLM backdoor attacks can be understood and detected by analyzing the explanations generated for model decisions, revealing distinct patterns for poisoned versus clean inputs.
Contribution
It introduces a novel approach using explanation generation to identify backdoor attacks in LLMs, providing new insights into their mechanisms and detection methods.
Findings
Backdoored models produce flawed explanations for poisoned data.
Explanation tokens for poisoned samples appear late in transformer layers.
Attention shifts away from input context during explanation for poisoned inputs.
Abstract
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Scientific Computing and Data Management · Software Engineering Research
MethodsSoftmax · Attention Is All You Need · LLaMA
