Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Haotian Jin; Yang Li; Haihui Fan; Lin Shen; Xiangfang Li; Bo Li

arXiv:2511.13789·cs.CR·April 15, 2026

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Haotian Jin, Yang Li, Haihui Fan, Lin Shen, Xiangfang Li, Bo Li

PDF

1 Video

TL;DR

This paper introduces a novel backdoor detection method for large language models that leverages attention similarity to identify and rectify anomalous attention heads, effectively defending against diverse backdoor triggers.

Contribution

The authors propose an attention similarity-based detection and alignment approach that does not require prior trigger knowledge, improving backdoor defense in LLMs.

Findings

01

High attention head similarity indicates backdoor presence.

02

The method significantly reduces attack success rates.

03

Model performance on downstream tasks is preserved.

Abstract

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks· underline