Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao, Hoagy Cunningham, Jerry Wei

TL;DR
This paper proposes a new streaming probing method that aggregates multiple evidence tokens to improve the detection of harmful intent in LLMs, especially in high-stakes CBRN contexts, achieving significant performance gains.
Contribution
It introduces a novel streaming probing objective requiring multiple tokens for robust detection, outperforming existing methods and maintaining effectiveness against adversarial obfuscation.
Findings
35.55% increase in true-positive rate at 1% false-positive rate
AUROC improved from 97.40% to over 98.85% with the new method
Probing attention and MLP activations outperforms residual-stream features
Abstract
Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
