Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy

TL;DR
This paper investigates how attack selection strategies in AI can undermine safety by enabling malicious code insertion, especially when models reason about monitors, significantly reducing safety levels.
Contribution
It introduces a framework for attack selection as a classification problem and demonstrates how reasoning about monitors can drastically decrease AI safety in controlled settings.
Findings
Safety is more sensitive to false positive rate than true positive rate.
Prompting models to reason about monitors reduces safety from 99% to 59%.
Attack selection capabilities can lead to overestimated safety scores.
Abstract
Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: mapping attacks to a quality score and mapping quality scores to submission probabilities. We frame attack selection as a classification problem and show that safety is significantly more sensitive to FPR than TPR. We find that prompting the attacker model to reason about the monitor while being cautious with attack selection reduces safety from a baseline of 99% to 59% at 0.5% auditing budget, emphasizing that eliciting attack selection capabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
