TL;DR
MonitorBench is an open-source benchmark designed to evaluate the ability of large language models to reliably monitor their reasoning processes through chains of thought, especially under stress conditions.
Contribution
It introduces a comprehensive, diverse set of test instances and stress-test settings to systematically assess CoT monitorability in LLMs.
Findings
Monitorability is higher when decision-critical factors influence intermediate reasoning.
More capable LLMs tend to have lower monitorability.
Stress-tests can reduce monitorability by up to 30% in some tasks.
Abstract
Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the actual reasons (i.e., decision-critical factors) driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for thoroughly evaluating CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize \textit{when} CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify \textit{the extent to which} CoT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
