Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

TL;DR
This paper investigates how internal representations of large language models differ during jailbreak and benign prompts, proposing a detection method based on latent space analysis that effectively blocks jailbreak attempts at inference time.
Contribution
It introduces a tensor-based latent representation framework for jailbreak detection that operates without fine-tuning and demonstrates its effectiveness across multiple open-source models.
Findings
Identified consistent latent-space patterns associated with harmful inputs.
Successfully blocked 78% of jailbreak attempts on LLaMA-3.1-8B.
Achieved 94% preservation of benign behavior while detecting jailbreaks.
Abstract
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
