Combining Cost-Constrained Runtime Monitors for AI Safety
Tim Tian Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy

TL;DR
This paper presents a method to optimally combine multiple runtime monitors for AI safety, maximizing detection recall while respecting cost constraints, demonstrated by significant improvements in a code review scenario.
Contribution
The paper introduces an algorithm that strategically combines monitors and allocates interventions, improving detection recall under budget constraints.
Findings
More than doubled recall rate compared to baseline
Combining two monitors can Pareto dominate individual monitors
Framework effectively balances detection and cost in AI safety monitoring
Abstract
Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to efficiently combine multiple runtime monitors into a single monitoring protocol. The protocol's objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors' performance and cost as given, we develop an algorithm to find the best protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
