Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

Antonio-Gabriel Chac\'on Menke; Phan Xuan Tan; Eiji Kamioka

arXiv:2510.18154·cs.AI·October 22, 2025

Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

Antonio-Gabriel Chac\'on Menke, Phan Xuan Tan, Eiji Kamioka

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a sentence-level labeled dataset for AI safety that enables detection and steering of safety behaviors in language model reasoning, improving safety monitoring at the activation level.

Contribution

It provides a novel dataset with sentence-level safety annotations for reasoning sequences, facilitating activation-based safety detection and intervention in LLMs.

Findings

01

Dataset enables detection of safety behaviors within reasoning chains.

02

Activation-based techniques can steer safety behaviors effectively.

03

Demonstrates improved safety oversight through activation monitoring.

Abstract

Recent work has highlighted the importance of monitoring chain-of-thought reasoning for AI safety; however, current approaches that analyze textual reasoning steps can miss subtle harmful patterns and may be circumvented by models that hide unsafe reasoning. We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors during LLM reasoning. Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent, which we use to extract steering vectors for detecting and influencing these behaviors within model activations. The dataset fills a key gap in safety research: while existing datasets label reasoning holistically, effective application of steering vectors for safety monitoring could be improved by identifying precisely when specific behaviors occur…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AISafety-Student/reasoning-safety-behaviours
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Occupational Health and Safety Research