Can Indirect Prompt Injection Attacks Be Detected and Removed?

Yulin Chen; Haoran Li; Yuan Sui; Yufei He; Yue Liu; Yangqiu Song; Bryan Hooi

arXiv:2502.16580·cs.CR·October 7, 2025

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi

PDF

Open Access 1 Video

TL;DR

This paper explores the detection and removal of indirect prompt injection attacks on large language models, introduces a benchmark dataset, and evaluates existing and new detection and mitigation methods.

Contribution

It is the first to systematically study indirect prompt injection detection and removal, providing a benchmark dataset and evaluating multiple approaches.

Findings

01

Existing LLMs show limited detection performance on indirect attacks.

02

Training detection models on crafted datasets improves detection accuracy.

03

Segmentation and extraction methods offer promising removal strategies.

Abstract

Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Indirect Prompt Injection Attacks Be Detected and Removed?· underline

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques