ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense
Yash Narendra

TL;DR
ESLD is a model-agnostic architecture that leverages internal guard model representations to significantly speed up and improve prompt-injection detection in AI assistants.
Contribution
This work introduces ESLD, a novel latent-space architecture that enhances prompt-injection defense by utilizing internal guard signals without retraining.
Findings
Speeds up safety checks by over 3 times on average.
Improves detection accuracy by 16.4 percentage points.
Compatible with existing guard models without retraining.
Abstract
Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying ``This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (``safe''…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
