ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu; Ziqian Zeng; Kehua Zhang; Haoran Li; Huiping Zhuang; Ruidong Wang; Cen Chen; Hao Peng

arXiv:2512.05745·cs.CR·December 8, 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, Hao Peng

PDF

Open Access

TL;DR

This paper introduces ARGUS, a novel modality-independent defense mechanism that steers MLLMs' representations to resist multimodal prompt injections while maintaining utility, through optimal subspace search and adaptive steering.

Contribution

The paper proposes ARGUS, a new method that effectively defends against multimodal IPI attacks by steering model representations within a safety subspace, balancing security and utility.

Findings

01

ARGUS achieves robust defense against multimodal IPI attacks.

02

It preserves model utility better than naive steering methods.

03

The approach includes lightweight detection and post-filtering for on-demand activation.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling