TL;DR
Behavior Cue Reasoning introduces special token signals in LLMs to enhance oversight, enabling more controllable reasoning and significantly reducing unsafe actions without sacrificing performance.
Contribution
We propose Behavior Cues, a novel training method that makes LLM reasoning more monitorable and controllable, improving safety and efficiency during complex reasoning tasks.
Findings
Behavior Cues enable up to 50% reduction in reasoning tokens.
They recover safe actions in 80% of unsafe reasoning traces.
No performance cost observed across multiple models and domains.
Abstract
Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
