Detecting Safety Violations Across Many Agent Traces
Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong

TL;DR
Meerkat is a novel method combining clustering and agentic search to detect safety violations across large agent trace datasets, outperforming existing approaches in various scenarios.
Contribution
The paper introduces Meerkat, a new approach that improves safety violation detection by structured search and adaptive investigation without relying on seed scenarios.
Findings
Meerkat uncovers sparse safety violations more effectively than baseline monitors.
It discovers widespread developer cheating on a top agent benchmark.
Finds nearly 4x more reward hacking examples on CyBench than previous methods.
Abstract
To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
