PromptArmor: Simple yet Effective Prompt Injection Defenses

Tianneng Shi; Kaijie Zhu; Zhun Wang; Yuqi Jia; Will Cai; Weida Liang; Haonan Wang; Hend Alzahrani; Joshua Lu; Kenji Kawaguchi; Basel Alomair; Xuandong Zhao; William Yang Wang; Neil Gong; Wenbo Guo; Dawn Song

arXiv:2507.15219·cs.CR·July 22, 2025

PromptArmor: Simple yet Effective Prompt Injection Defenses

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song

PDF

TL;DR

PromptArmor is a straightforward defense mechanism that prompts an LLM to detect and eliminate prompt injections, significantly reducing attack success rates and maintaining high accuracy in identifying malicious prompts.

Contribution

This paper introduces PromptArmor, a simple yet effective method for defending against prompt injection attacks by leveraging LLM prompting techniques.

Findings

01

Achieves below 1% false positive and false negative rates on the AgentDojo benchmark.

02

Reduces attack success rate to below 1% after prompt removal.

03

Effective against adaptive prompt injection attacks.

Abstract

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.