Defending against Indirect Prompt Injection by Instruction Detection

Tongyu Wen; Chenglong Wang; Xiyuan Yang; Haoyu Tang; Yueqi Xie; Lingjuan Lyu; Zhicheng Dou; Fangzhao Wu

arXiv:2505.06311·cs.CR·January 7, 2026

Defending against Indirect Prompt Injection by Instruction Detection

Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces InstructDetector, a detection method that leverages LLMs' behavioral states to identify and defend against indirect prompt injection attacks with high accuracy and minimal attack success.

Contribution

The paper presents a novel instruction detection approach using intermediate layer features of LLMs to defend against IPI attacks, achieving state-of-the-art detection accuracy.

Findings

01

Detection accuracy of 99.60% in-domain

02

Detection accuracy of 96.90% out-of-domain

03

Reduces attack success rate to 0.03% on BIPIA benchmark

Abstract

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MYVAE/Instruction-detection
pytorchOfficial

Videos

Defending against Indirect Prompt Injection by Instruction Detection· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing