Hodoscope: Unsupervised Monitoring for AI Misbehaviors
Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

TL;DR
Hodoscope is an unsupervised monitoring tool that helps humans discover problematic AI behaviors by comparing behavior distributions across groups, reducing review effort and uncovering new vulnerabilities.
Contribution
We introduce Hodoscope, a novel unsupervised method for monitoring AI behaviors through group-wise comparisons, enabling discovery of unknown failures without predefined rules.
Findings
Hodoscope identified a new vulnerability in the Commit0 benchmark.
It recovered known exploits on ImpossibleBench and SWE-bench.
It reduces human review effort by 6-23 times.
Abstract
Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human. We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
