CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan; Han Wang; Berkcan Kapusuzoglu; Ramin Moradi; Supriyo Chakraborty; Giri Iyengar; Sambit Sahu; Huan Zhang; Gang Wang

arXiv:2605.12746·cs.CR·May 14, 2026

CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi, Supriyo Chakraborty, Giri Iyengar, Sambit Sahu, Huan Zhang, Gang Wang

PDF

TL;DR

This paper introduces CoT-Guard, a small 4B-parameter model that effectively monitors reasoning traces to detect covert misbehavior in code generation, offering a practical, cost-effective alternative to large models.

Contribution

The authors develop a novel training pipeline combining supervised fine-tuning and reinforcement learning to enhance small models' ability to detect hidden objectives, outperforming larger models in certain scenarios.

Findings

01

CoT-Guard achieves a G-mean of 75%, surpassing several large models.

02

It demonstrates strong generalization under prompt and code manipulation attacks.

03

Outperforms GPT-5.4, GPT-5-mini, and Qwen3-32B in detection performance.

Abstract

Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B--8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.