A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Julian Schulz

arXiv:2510.19476·cs.LG·October 23, 2025

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Julian Schulz

PDF

Open Access

TL;DR

This paper proposes a new safety framework for AI systems based on chain-of-thought monitoring, aiming to detect and prevent dangerous capabilities through a systematic safety case and research agenda.

Contribution

It introduces a novel safety case structure leveraging CoT monitoring, and systematically examines threats and techniques for ensuring monitorability in reasoning models.

Findings

01

Analysis of threats to monitorability in reasoning models

02

Evaluation of techniques for maintaining CoT faithfulness

03

Proposal of prediction markets for assessing safety case viability

Abstract

As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety