Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Shravan Doda

TL;DR
This paper investigates the limitations of final-token safety probes in detecting unsafe content in language models, revealing that unsafe evidence can be distributed earlier and missed at final readout, and proposes trajectory-based diagnostics.
Contribution
It identifies a failure mode of final-token safety probes, analyzes the distribution of unsafe evidence, and introduces trajectory-aware diagnostics to improve safety assessment.
Findings
Final-token probes miss jailbreak prompts that distribute unsafe evidence earlier in sequences.
Increasing probe bottleneck width does not reliably fix representational mismatches.
Trajectory models can recover missed unsafe content without false positives.
Abstract
Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
