AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang; Jingyu Hu; Tong Li; Hanqi Yan; Wenxuan Wang; Di Wang

arXiv:2601.05752·cs.CL·May 13, 2026

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

PDF

TL;DR

AutoMonitor-Bench is a comprehensive benchmark for assessing the reliability of LLM-based misbehavior monitors across various tasks, revealing performance variability and the challenges in achieving dependable monitoring.

Contribution

The paper introduces AutoMonitor-Bench, the first systematic benchmark for evaluating LLM misbehavior monitors, including a large annotated dataset and analysis of performance trade-offs.

Findings

01

Significant variability in monitor performance across models.

02

Trade-off observed between Miss Rate and False Alarm Rate.

03

Training on easy misbehavior datasets does not fully improve detection of complex misbehaviors.

Abstract

We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.