Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming
Pride Kavumba, Koki Wataoka, Huy H. Nguyen, Jiaxuan Li, Masaya Ohagi

TL;DR
This paper introduces StreamGuard, a forecasting-based streaming safety guardrail for LLMs that predicts future harmfulness to enable early intervention without needing exact boundary annotations.
Contribution
It proposes a unified, model-agnostic approach to streaming moderation using forecasting supervised by Monte Carlo rollouts, improving safety performance across benchmarks.
Findings
StreamGuard improves input moderation F1 from 86.7 to 88.2.
StreamGuard achieves 97.5 F1 on response moderation benchmark.
Forecasting supervision transfers effectively across models and tokenizers.
Abstract
In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
