MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Monika Jotautait\.e; Maria Angelica Martinez; Ollie Matthews; Tyler Tracy

arXiv:2605.09684·cs.CR·May 12, 2026

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Monika Jotautait\.e, Maria Angelica Martinez, Ollie Matthews, Tyler Tracy

PDF

TL;DR

MonitoringBench is a semi-automated red-teaming framework that generates diverse, strong attack trajectories to evaluate and improve coding-agent monitors, revealing current limitations and guiding future enhancements.

Contribution

The paper introduces a novel semi-automated red-teaming pipeline and a comprehensive attack benchmark for assessing coding-agent monitor robustness.

Findings

01

Attacks reduce monitor catch rate from 94.9% to 60.3%.

02

Attacks generalize across multiple monitors, revealing common vulnerabilities.

03

Current frontier monitors detect suspicious actions but struggle with persuasion and calibration.

Abstract

We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.