A sketch of an AI control safety case
Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving

TL;DR
This paper proposes a structured approach called a 'control safety case' for AI developers to justify the safety of deploying large language models by modeling their inability to subvert control measures, demonstrated through a hypothetical case study.
Contribution
It introduces the concept of a control safety case and sketches its application to assess the safety of deploying potentially dangerous LLM agents.
Findings
A structured safety argument can be built around control measures and model evaluation.
Red team testing can inform the safety case by simulating potential model exploits.
Conservative extrapolation of model capabilities supports safety assurances.
Abstract
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Occupational Health and Safety Research · Ethics and Social Impacts of AI
