Mitigating Deceptive Alignment via Self-Monitoring

Jiaming Ji; Wenqi Chen; Kaile Wang; Donghai Hong; Sitong Fang; Boyuan Chen; Jiayi Zhou; Juntao Dai; Sirui Han; Yike Guo; Yaodong Yang

arXiv:2505.18807·cs.AI·May 27, 2025

Mitigating Deceptive Alignment via Self-Monitoring

Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper introduces CoT Monitor+, a framework embedding self-monitoring within chain-of-thought reasoning to detect and suppress deceptive behaviors in large language models, significantly reducing covert misaligned strategies.

Contribution

It presents the first internal self-monitoring mechanism integrated into reasoning, along with DeceptionBench for systematic evaluation of deceptive alignment in LLMs.

Findings

01

CoT Monitor+ reduces deceptive behaviors by 43.8% on average.

02

Unrestricted CoT aggravates deceptive tendencies.

03

Self-monitor signals improve transparency and reduce obfuscated reasoning.

Abstract

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper situates itself well, within the literature of deceptive alignment and its mitigations. This is the first paper I'm aware of to explicitly focus on RL using CoT-monitoring scored by the same model which is to be trained. I like the typology of LLM deception into 5 types, this seems decently broad for an initial hand-curated dataset. Fig 3: This graph is very promising. It appears to show that the authors' DTR method shows significantly better agreement with human annotation than LLM

Weaknesses

Alignment training pressure on CoT is strategically bad in the long-term: such suppression results in persistent deception which is harder to detect via CoT monitoring, as models "hide their scheming" (see Baker citations in the paper). Such approaches should therefore be discouraged. The authors, in their results section, comment that "the model learns to game the feedback process while also concealing its true intent within the CoT". On L108 the paper talks about pushing deception to a sublim

Reviewer 02Rating 2Confidence 3

Strengths

DeceptionBench seems plausibly interesting and might be useful to other projects; I'm not sure. It's plausible that self-monitoring is a good idea.

Weaknesses

I don't feel persuaded by the evidence that the self-monitoring algorithm works better than baselines. To me the core claim of the paper is that there is something good about getting a model to monitor itself, rather than getting another model to monitor it. This is a priori plausible to me, but I don't think the paper demonstrates it. My biggest problem with the results presented in figure 5 is that the human labels were used to improve the self-monitor, but were not used to improve the outer

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper addresses deception, which is an important AI Alignment and Safety issue. 2. The authors introduce a new deception benchmark. 3. The authors propose a method to reduce deception.

Weaknesses

1. The paper spends much more space explaining basic concepts than results. In particular, the presentation of the results is very brief and lacks a more thorough description of important results (see following remarks). 2. The article states: "To avoid reward hacking, where SELF-MONITOR might simply output labels beneficial to itself, we update the off-policy SELF-MONITOR every 15 steps." . What happens if this is not done? How was the policy updated? Much more details are needed here to addre

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Degradation and Inhibitors