MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents

Kaijie Zhu; Xianjun Yang; Jindong Wang; Wenbo Guo; William Yang Wang

arXiv:2502.05174·cs.CR·June 12, 2025

MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents

Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang

PDF

Open Access 1 Repo

TL;DR

MELON is a novel defense mechanism against indirect prompt injection attacks in AI agents, using re-execution with masked prompts to detect malicious behavior while maintaining utility.

Contribution

We introduce MELON, a new provable defense against IPI attacks that detects malicious actions by comparing original and masked execution trajectories, outperforming existing methods.

Findings

01

MELON outperforms state-of-the-art defenses in attack prevention.

02

MELON maintains higher utility compared to existing defenses.

03

Combining MELON with prompt augmentation further enhances security.

Abstract

Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent's next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaijiezhu11/melon
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Physical Unclonable Functions (PUFs) and Hardware Security · Radiation Effects in Electronics