Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
Matteo Pistillo, Samantha Faraone, Joshua Herman

TL;DR
This paper introduces an empirical methodology for mitigating Loss of Control in AI systems used in national security by backchaining from errors on specific benchmarks to identify and intervene on harmful affordances and permissions.
Contribution
It proposes a practical, step-by-step approach to identify and mitigate LoC risks using existing benchmarks, enabling deployers to act based on evidence they can generate.
Findings
Method enables targeted intervention on harmful affordances.
Approach is demonstrated with a security classification benchmark.
Supports proactive mitigation of LoC threats in high-stakes contexts.
Abstract
Affordances and permissions are promising and timely safety levers for mitigating Loss of Control (LoC) threats in high-stakes deployment contexts, such as national security. Deployers in defense and intelligence could rely on several approaches to identify which affordances and permissions should be prioritized, such as structured threat modelling, pre-deployment agentic evaluations, post-deployment continuous monitoring, and AI safety cases. This paper proposes a complementary and empirical methodology that leverages existing use-case-specific benchmarks: backchaining LoC mitigations from the errors an AI system makes on national security benchmarks. The approach proceeds in three steps and allows national security deployers to start building LoC mitigations today, from evidence they can generate themselves. First, deployers evaluate AI systems on mission-specific benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
