Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Matteo Pistillo; Samantha Faraone; Joshua Herman

arXiv:2605.21095·cs.CY·May 21, 2026

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Matteo Pistillo, Samantha Faraone, Joshua Herman

PDF

TL;DR

This paper introduces an empirical methodology for mitigating Loss of Control in AI systems used in national security by backchaining from errors on specific benchmarks to identify and intervene on harmful affordances and permissions.

Contribution

It proposes a practical, step-by-step approach to identify and mitigate LoC risks using existing benchmarks, enabling deployers to act based on evidence they can generate.

Findings

01

Method enables targeted intervention on harmful affordances.

02

Approach is demonstrated with a security classification benchmark.

03

Supports proactive mitigation of LoC threats in high-stakes contexts.

Abstract

Affordances and permissions are promising and timely safety levers for mitigating Loss of Control (LoC) threats in high-stakes deployment contexts, such as national security. Deployers in defense and intelligence could rely on several approaches to identify which affordances and permissions should be prioritized, such as structured threat modelling, pre-deployment agentic evaluations, post-deployment continuous monitoring, and AI safety cases. This paper proposes a complementary and empirical methodology that leverages existing use-case-specific benchmarks: backchaining LoC mitigations from the errors an AI system makes on national security benchmarks. The approach proceeds in three steps and allows national security deployers to start building LoC mitigations today, from evidence they can generate themselves. First, deployers evaluate AI systems on mission-specific benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.