TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
Cheng Liu, Xiaolei Liu, Xingyu Li, Bangzhou Xin, Kangyi Ding

TL;DR
TrajGuard is a real-time, decoding-time defense framework that detects jailbreak attempts by analyzing hidden-state trajectories, achieving high defense rates with minimal latency without modifying the language model.
Contribution
It introduces a novel, training-free method that leverages hidden-state trajectories during decoding for effective jailbreak detection in real time.
Findings
Achieves an average defense rate of 95% across 12 jailbreak attacks.
Reduces detection latency to 5.2 ms per token.
Maintains false positive rate below 1.5%.
Abstract
Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
