PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
Guangyu Gong, Zizhuang Deng

TL;DR
PlanGuard is a training-free framework that enhances LLM agent security against indirect prompt injection by using planning and hierarchical verification to ensure behavior aligns with user instructions.
Contribution
It introduces a novel planning-based consistency verification method that effectively defends against IPI attacks without retraining the model.
Findings
PlanGuard reduces attack success rate from 72.8% to 0%.
It maintains a low false positive rate of 1.49%.
The method is model-agnostic and compatible with various systems.
Abstract
Large Language Model (LLM) agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre-processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training-free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
