MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
Monika Jotautait\.e, Maria Angelica Martinez, Ollie Matthews, Tyler Tracy

TL;DR
MonitoringBench is a semi-automated red-teaming framework that generates diverse, strong attack trajectories to evaluate and improve coding-agent monitors, revealing current limitations and guiding future enhancements.
Contribution
The paper introduces a novel semi-automated red-teaming pipeline and a comprehensive attack benchmark for assessing coding-agent monitor robustness.
Findings
Attacks reduce monitor catch rate from 94.9% to 60.3%.
Attacks generalize across multiple monitors, revealing common vulnerabilities.
Current frontier monitors detect suspicious actions but struggle with persuasion and calibration.
Abstract
We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
