A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
Julian Schulz

TL;DR
This paper proposes a new safety framework for AI systems based on chain-of-thought monitoring, aiming to detect and prevent dangerous capabilities through a systematic safety case and research agenda.
Contribution
It introduces a novel safety case structure leveraging CoT monitoring, and systematically examines threats and techniques for ensuring monitorability in reasoning models.
Findings
Analysis of threats to monitorability in reasoning models
Evaluation of techniques for maintaining CoT faithfulness
Proposal of prediction markets for assessing safety case viability
Abstract
As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety
