TL;DR
ProjLens is an interpretability framework that reveals how backdoor attacks operate in multimodal models, identifying low-rank structures and activation mechanisms that enable vulnerabilities.
Contribution
The paper introduces ProjLens, a novel method to understand backdoor mechanisms in multimodal models, highlighting differences from text-only models and uncovering key activation patterns.
Findings
Backdoor injection updates are full-rank and lack dedicated trigger neurons.
Backdoor-critical parameters are encoded within a low-rank subspace of the projector.
Backdoor activation involves a linear scaling of semantic shifts with input norm.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
