When Backdoors Speak: Understanding LLM Backdoor Attacks Through   Model-Generated Explanations

Huaizhi Ge; Yiming Li; Qifan Wang; Yongfeng Zhang; Ruixiang Tang

arXiv:2411.12701·cs.CR·February 18, 2025

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, Ruixiang Tang

PDF

Open Access

TL;DR

This paper explores how LLM backdoor attacks can be understood and detected by analyzing the explanations generated for model decisions, revealing distinct patterns for poisoned versus clean inputs.

Contribution

It introduces a novel approach using explanation generation to identify backdoor attacks in LLMs, providing new insights into their mechanisms and detection methods.

Findings

01

Backdoored models produce flawed explanations for poisoned data.

02

Explanation tokens for poisoned samples appear late in transformer layers.

03

Attention shifts away from input context during explanation for poisoned inputs.

Abstract

Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Scientific Computing and Data Management · Software Engineering Research

MethodsSoftmax · Attention Is All You Need · LLaMA