Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He; Bilgehan Sel; Faizan Ali; Jenny Bao; Hoagy Cunningham; Jerry Wei

arXiv:2604.14865·cs.CL·April 17, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao, Hoagy Cunningham, Jerry Wei

PDF

TL;DR

This paper proposes a new streaming probing method that aggregates multiple evidence tokens to improve the detection of harmful intent in LLMs, especially in high-stakes CBRN contexts, achieving significant performance gains.

Contribution

It introduces a novel streaming probing objective requiring multiple tokens for robust detection, outperforming existing methods and maintaining effectiveness against adversarial obfuscation.

Findings

01

35.55% increase in true-positive rate at 1% false-positive rate

02

AUROC improved from 97.40% to over 98.85% with the new method

03

Probing attention and MLP activations outperforms residual-stream features

Abstract

Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.