From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li; Qiang Sheng; Yehan Yang; Xueyao Zhang; Juan Cao

arXiv:2506.09996·cs.CL·September 23, 2025

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper introduces a streaming content monitor trained for early detection of harmful outputs in LLMs, reducing latency and improving safety by making timely judgments during generation.

Contribution

It presents a novel data and model approach supporting partial detection, including a new dataset and a streaming monitor that outperforms existing methods in early harmfulness detection.

Findings

01

Achieves 0.95+ macro F1 score with only 18% of tokens observed

02

Outperforms full detection in early harmfulness identification

03

Enhances safety alignment by serving as a pseudo-harmfulness annotator

Abstract

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

liyang-ict/FineHarm
dataset· 65 dl
65 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling

MethodsSoftmax · Attention Is All You Need · travel james · Direct Preference Optimization