Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-P\'erez

TL;DR
This paper demonstrates that diverse ensembles of monitors, especially those with fine-tuning, significantly improve detection of misaligned AI actions over individual or homogeneous monitors, emphasizing diversity over scale.
Contribution
It introduces a method of ensemble monitoring using diverse GPT-4.1-Mini monitors, showing that diversity enhances safety detection more than simply increasing scale.
Findings
Diverse ensembles outperform homogeneous ones in detecting misaligned actions.
Fine-tuned monitors maintain detection advantages on out-of-distribution attacks.
Ensemble diversity, not scale, is key to improved AI safety monitoring.
Abstract
As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
