Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Eugene Koran; Yejun Yun; Samantha Tetef; Benjamin Arnav; Pablo Bernabeu-P\'erez

arXiv:2605.15377·cs.AI·May 19, 2026

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-P\'erez

PDF

TL;DR

This paper demonstrates that diverse ensembles of monitors, especially those with fine-tuning, significantly improve detection of misaligned AI actions over individual or homogeneous monitors, emphasizing diversity over scale.

Contribution

It introduces a method of ensemble monitoring using diverse GPT-4.1-Mini monitors, showing that diversity enhances safety detection more than simply increasing scale.

Findings

01

Diverse ensembles outperform homogeneous ones in detecting misaligned actions.

02

Fine-tuned monitors maintain detection advantages on out-of-distribution attacks.

03

Ensemble diversity, not scale, is key to improved AI safety monitoring.

Abstract

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.