Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Ziqian Zhong; Shashwat Saxena; Aditi Raghunathan

arXiv:2604.11072·cs.AI·April 14, 2026

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

PDF

TL;DR

Hodoscope is an unsupervised monitoring tool that helps humans discover problematic AI behaviors by comparing behavior distributions across groups, reducing review effort and uncovering new vulnerabilities.

Contribution

We introduce Hodoscope, a novel unsupervised method for monitoring AI behaviors through group-wise comparisons, enabling discovery of unknown failures without predefined rules.

Findings

01

Hodoscope identified a new vulnerability in the Commit0 benchmark.

02

It recovered known exploits on ImpossibleBench and SWE-bench.

03

It reduces human review effort by 6-23 times.

Abstract

Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human. We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.