CoT-Guard: Small Models for Strong Monitoring
Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi, Supriyo Chakraborty, Giri Iyengar, Sambit Sahu, Huan Zhang, Gang Wang

TL;DR
This paper introduces CoT-Guard, a small 4B-parameter model that effectively monitors reasoning traces to detect covert misbehavior in code generation, offering a practical, cost-effective alternative to large models.
Contribution
The authors develop a novel training pipeline combining supervised fine-tuning and reinforcement learning to enhance small models' ability to detect hidden objectives, outperforming larger models in certain scenarios.
Findings
CoT-Guard achieves a G-mean of 75%, surpassing several large models.
It demonstrates strong generalization under prompt and code manipulation attacks.
Outperforms GPT-5.4, GPT-5-mini, and Qwen3-32B in detection performance.
Abstract
Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B--8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
