Test-Time Attention Purification for Backdoored Large Vision Language Models
Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu

TL;DR
This paper introduces CleanSight, a test-time defense method for large vision-language models that detects and neutralizes backdoor triggers by analyzing and pruning attention patterns, without retraining.
Contribution
The paper provides a mechanistic understanding of backdoor behaviors in LVLMs and proposes a novel, training-free, test-time defense method based on attention analysis.
Findings
CleanSight effectively detects poisoned inputs using attention ratios.
It neutralizes backdoors by pruning high-attention visual tokens.
The method outperforms existing defenses across multiple datasets and attack types.
Abstract
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
