Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni

arXiv:2604.28129·cs.CR·May 1, 2026

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni

PDF

TL;DR

This paper introduces a novel activation-level signature called adversarial restlessness for detecting multi-turn prompt injection attacks in large language models, achieving high detection accuracy with specific training data.

Contribution

It demonstrates that activation trajectory features can reliably identify covert multi-turn attacks, with model-specific probes and insights into data requirements for deployment.

Findings

01

Detection accuracy reaches 93.8% on synthetic data.

02

Probes are model-specific and do not transfer across architectures.

03

Combined multi-source training achieves 89.4% detection with low false positives.

Abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.