Mitigating Deceptive Alignment via Self-Monitoring
Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang

TL;DR
This paper introduces CoT Monitor+, a framework embedding self-monitoring within chain-of-thought reasoning to detect and suppress deceptive behaviors in large language models, significantly reducing covert misaligned strategies.
Contribution
It presents the first internal self-monitoring mechanism integrated into reasoning, along with DeceptionBench for systematic evaluation of deceptive alignment in LLMs.
Findings
CoT Monitor+ reduces deceptive behaviors by 43.8% on average.
Unrestricted CoT aggravates deceptive tendencies.
Self-monitor signals improve transparency and reduce obfuscated reasoning.
Abstract
Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper situates itself well, within the literature of deceptive alignment and its mitigations. This is the first paper I'm aware of to explicitly focus on RL using CoT-monitoring scored by the same model which is to be trained. I like the typology of LLM deception into 5 types, this seems decently broad for an initial hand-curated dataset. Fig 3: This graph is very promising. It appears to show that the authors' DTR method shows significantly better agreement with human annotation than LLM
Alignment training pressure on CoT is strategically bad in the long-term: such suppression results in persistent deception which is harder to detect via CoT monitoring, as models "hide their scheming" (see Baker citations in the paper). Such approaches should therefore be discouraged. The authors, in their results section, comment that "the model learns to game the feedback process while also concealing its true intent within the CoT". On L108 the paper talks about pushing deception to a sublim
DeceptionBench seems plausibly interesting and might be useful to other projects; I'm not sure. It's plausible that self-monitoring is a good idea.
I don't feel persuaded by the evidence that the self-monitoring algorithm works better than baselines. To me the core claim of the paper is that there is something good about getting a model to monitor itself, rather than getting another model to monitor it. This is a priori plausible to me, but I don't think the paper demonstrates it. My biggest problem with the results presented in figure 5 is that the human labels were used to improve the self-monitor, but were not used to improve the outer
1. The paper addresses deception, which is an important AI Alignment and Safety issue. 2. The authors introduce a new deception benchmark. 3. The authors propose a method to reduce deception.
1. The paper spends much more space explaining basic concepts than results. In particular, the presentation of the results is very brief and lacks a more thorough description of important results (see following remarks). 2. The article states: "To avoid reward hacking, where SELF-MONITOR might simply output labels beneficial to itself, we update the off-policy SELF-MONITOR every 15 steps." . What happens if this is not done? How was the policy updated? Much more details are needed here to addre
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Degradation and Inhibitors
