Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang; Chong Xiang; Sanjay Kariyappa; Chaowei Xiao; Bo Li; Edward Suh

arXiv:2512.00966·cs.CR·December 2, 2025

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, Edward Suh

PDF

Open Access 3 Reviews

TL;DR

This paper introduces IntentGuard, a framework that defends against indirect prompt injection attacks by analyzing the model's intent to follow instructions, significantly reducing attack success rates without harming utility.

Contribution

The paper proposes a novel intent analysis framework for LLMs that effectively detects and neutralizes malicious instructions hidden in input data.

Findings

01

IntentGuard reduces attack success rates from 100% to 8.5%.

02

It maintains utility in most scenarios.

03

It is effective against adaptive prompt injection attacks.

Abstract

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper introduces a defense that focuses on instruction-following intent rather than surface-level detection of malicious text, targeting a gap in existing IPI mitigation. The proposed IntentGuard framework integrates intent extraction with defensive mechanisms. The experiments on agent benchmarks (AgentDojo, Mind2Web) show reduced attack success rates without affecting normal task performance.

Weaknesses

The approach depends heavily on the accuracy and reliability of the intent analyzer (IIA), which itself is an LLM and thus vulnerable to adversarial manipulation or misinterpretation of subtle intent. It also assumes the availability of reasoning traces (“thinking”) that may not exist in closed-source or lightweight models. Moreover, the method presumes that the system can reliably identify which parts of the prompt are untrusted—a strong and often unrealistic assumption in real-world agentic se

Reviewer 02Rating 2Confidence 4

Strengths

- Prompt injection is a very important problem that is still unsolved - Results are better than baselines - Experiments are done on a representative number of datasets

Weaknesses

There might be a lot of attacks and generalization experiments that the paper didn't study. I will organize them below. - Attacks targeting the intent analyzer: The paper didn't try straightforward attacks that tell the model to not report the injection, or to simply destroy the formatting of thinking tokens and the list of instructions that get extracted and parsed. - Attacks targeting origin tracing: The paper currently uses sparse embeddings matching between the reported instructions and f

Reviewer 03Rating 4Confidence 4

Strengths

Clear reframing of prompt injection defense from “detect malicious-looking strings” to “detect whether the model intends to follow instructions from untrusted segments. Practical, modular pipeline: intent extraction → origin tracing → mitigation, with both alert and recovery modes and sliding-window matching tolerant to paraphrase. Ablations are informative and align with the method’s mechanisms

Weaknesses

The defense triggers only on instructions listed by IIA; the authors acknowledge unfaithful cases (10.9% bottom-left in Fig. 4) where actions are followed without listed intent. While adversarial demonstrations help, the pipeline offers no fallback when malicious execution occurs without explicit intent listing, leaving a residual risk precisely under stealth objectives. The proposed method, while practical, but sounds very trivial to me. The proposed thinking intervention is just doing the pro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Topic Modeling