ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

Yash Narendra

arXiv:2605.18918·cs.CR·May 20, 2026

ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

Yash Narendra

PDF

TL;DR

ESLD is a model-agnostic architecture that leverages internal guard model representations to significantly speed up and improve prompt-injection detection in AI assistants.

Contribution

This work introduces ESLD, a novel latent-space architecture that enhances prompt-injection defense by utilizing internal guard signals without retraining.

Findings

01

Speeds up safety checks by over 3 times on average.

02

Improves detection accuracy by 16.4 percentage points.

03

Compatible with existing guard models without retraining.

Abstract

Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying ``This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (``safe''…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.