MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models
Jingyu Hu, Shu Yang, Xilin Gong, Hongming Wang, Weiru Liu, Di Wang

TL;DR
This paper introduces MONICA, a real-time monitoring and calibration framework that detects and reduces sycophantic behavior in large reasoning models during the reasoning process, improving reliability and societal safety.
Contribution
MONICA is the first framework to monitor and mitigate sycophancy during reasoning steps without waiting for final answers, enhancing model trustworthiness.
Findings
Effectively reduces sycophantic behavior in reasoning models
Improves robustness across multiple datasets and models
Provides real-time monitoring during inference
Abstract
Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users' incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors have identified a rather critical and often overlooked issue concerning sycophantic behavior within the intermediate chain-of-thought reasoning processes of Large Reasoning Models. - The motivation behind tackling this specific aspect of sycophancy is quite clear, and the proposed MONICA framework is described with good clarity. - The experimental evaluation is fairly extensive, covering 12 datasets and 3 different LRMs, which helps demonstrate a degree of generalizability for th
- While the authors mention using GPT-4o to identify and label sycophantic patterns in Section 2.2 with "manual annotation for deduplication and quality control", I would appreciate a quantitative validation of the annotation quality, such as inter-annotator agreement or comparison with human expert labels. This is important since if GPT-4o exhibits biases in identifying sycophancy, these biases would propagate throughout the entire training pipeline and the final system. - This paper claims "Re
1. The proposed methods show improvement in reducing model's sycophantic behavior. 2. The idea of using a monitor to control calibration is novel and makes some sense.
The weakness of this paper mainly lies in novelty and applications: 1. From my perspecitive, the monitor and calibrator ideas are not novel and widely used in activation engineering literature. The paper's contribution is mainly applying existing methods to large reasoning models to reduce sycophancy. The technical difficulty of this is not well justified by the paper, limiting its novelty. 2. Despite its ability to reduce sycophancy, I still have concern on the applicability of proposed metho
1. MONICA operates during inference by manipulating model activations, making it computationally efficient as it does not require expensive model fine-tuning. 2. The framework is highly precise because it is trained on a specialized dataset of subtle, sentence-level sycophantic patterns, allowing it to accurately identify and correct flawed reasoning that other methods miss. 3. Experiments show that MONICA consistently outperforms other mitigation strategies, effectively reducing sycophancy whil
1. To see how MONICA performs under normal setting (i.e., no cue is given), the authors should report the normal performance (e.g., accuracy) on the reasoning datasets when no cues are given with the monitoring and calibration on. 2. When constructing the sycophancy dataset, the authors classify responses as sycophantic when the predicted answers match with the incorrect cues. However, the model could also happen to predict this answer even it is not favouring the user's cue. This rule-based c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Data Stream Mining Techniques
