Optimizing Agent Planning for Security and Autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris K\"opf, Tobias Nie{\ss}en, Mark Russinovich, Shruti Tople, Santiago Zanella-B\'eguelin

TL;DR
This paper presents a security-aware agent design that enhances autonomy by reducing human oversight needs while maintaining security, using system-level defenses against prompt injection attacks evaluated on benchmark tasks.
Contribution
It introduces autonomy metrics and a planning approach that balances task progress with policy compliance, improving autonomous execution under security constraints.
Findings
Higher autonomy achieved without utility loss
Effective defense against prompt injection attacks
Improved planning for security and task completion
Abstract
Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing…
Peer Reviews
Decision·ICLR 2026 Poster
1. The research objective is interesting and bridge the gap between agents in research and agents in real world by taking human confirmation into account. 2. The paper is well organized, especially for the evaluation section, the findings and corresponding empirical evidence are clear. The models are also basically state-of-the-art models, which make the results more convincing.
1. It is not clear why the paper claims the agent can achieve security "guarantee". If I understand correctly, there may still be policy violation even with the policy aware planning. 2. The paper proposes metrics motivated by real agents but no evaluations on these agents are provided. 3. One key insight is that task completion rate is not a good metric for real world agents with human interactions, but no qualitative or quantitative analysis directly show how bad the task completion rate cou
1. The focused problem and proposed metrics are realistic. In the experience of using LLM-based AI agents, HITL approvals are disruptive and slow down the task completion process, yet they are necessary for security. 2. Policy and security label awareness encourage the agent to use safe and trusted tools and data. With this feature integrated into planning, the AI agent can reduce vulnerability at the root. This both reduces HITL interventions and avoids malicious actions. 3. The dual LLM design
1. Evaluation. Both AgentDojo and WASP (based on VisualWebArena) contain different subtasks or websites. Previous work like CaMeL [1] also shows results on the subtasks. I suggest the authors include this analysis, since different tasks may require different steps and HITL interventions fundamentally, and averaging them out might not be reasonable. 2. Policy-aware planning needs additional analysis on the total number of steps and the time required to complete the tasks. The agents will try not
1. This paper presents the agent design of PRUDENTIA and deterministic system-level defense, which is well-motivated and aims to address a critical issue in agent security research. 2. To quantify the autonomy benefits and evaluate the proposed approach, this paper introduces new autonomy metrics and conducts extensive experiments. 3. This paper has good real-world applications, e.g., reducing security risks, improving user trusts, and safer integration with external data for AI agents.
1. This paper only focuses on indirect prompt injection attacks, the generalization of the proposed approach on other types of attacks is not discussed. 2. While the proposed approach outperforms baselines in autonomy, its completion rate shows some regression (73.2% vs. s FIDES’ 75.7%). This might suggest some minor trade-offs between autonomy and performance, but this is not discussed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Information and Cyber Security
