AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

TL;DR
AutoMonitor-Bench is a comprehensive benchmark for assessing the reliability of LLM-based misbehavior monitors across various tasks, revealing performance variability and the challenges in achieving dependable monitoring.
Contribution
The paper introduces AutoMonitor-Bench, the first systematic benchmark for evaluating LLM misbehavior monitors, including a large annotated dataset and analysis of performance trade-offs.
Findings
Significant variability in monitor performance across models.
Trade-off observed between Miss Rate and False Alarm Rate.
Training on easy misbehavior datasets does not fully improve detection of complex misbehaviors.
Abstract
We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
