MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Han Wang; Yifan Sun; Brian Ko; Mann Talati; Jiawen Gong; Zimeng Li; Naicheng Yu; Xucheng Yu; Wei Shen; Vedant Jolly; Huan Zhang

arXiv:2603.28590·cs.AI·April 3, 2026

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang

PDF

1 Repo

TL;DR

MonitorBench is an open-source benchmark designed to evaluate the ability of large language models to reliably monitor their reasoning processes through chains of thought, especially under stress conditions.

Contribution

It introduces a comprehensive, diverse set of test instances and stress-test settings to systematically assess CoT monitorability in LLMs.

Findings

01

Monitorability is higher when decision-critical factors influence intermediate reasoning.

02

More capable LLMs tend to have lower monitorability.

03

Stress-tests can reduce monitorability by up to 30% in some tasks.

Abstract

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the actual reasons (i.e., decision-critical factors) driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for thoroughly evaluating CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize \textit{when} CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify \textit{the extent to which} CoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ASTRAL-Group/MonitorBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.