Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan; Zheng-Xin Yong; Stephen H. Bach

arXiv:2507.12428·cs.CL·October 8, 2025

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether reasoning traces in language models can be used to predict unsafe outputs early, enabling timely intervention, and finds that latent activations provide more reliable signals than text alone.

Contribution

It introduces a simple linear probe on model activations that outperforms text-based methods in predicting response safety and can be applied early in the reasoning process.

Findings

01

Latent activations outperform text in safety prediction.

02

Early signals of misalignment appear before reasoning completes.

03

Lightweight probes enable real-time safety monitoring.

Abstract

Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

When access to a model’s activations is available, the paper demonstrates an important takeaway: the activations hold sufficient information to be predictive of eventual misalignment in long thinking or reasoning traces. This can facilitate setting up effective test-time safety guardrails

Weaknesses

The analysis seems to have an unaccounted pathway for leakage of information, which influences the findings and takeaways. The *activations* at the final token position of the last layer for each partial CoT (Line 163) implicitly encode the **prompt** itself, in addition to the subsequent CoT. This leads to a few issues: - This potentially explains the effectiveness of the linear probe: if the prompt itself is indicative of the final misalignment of the response, the CoT segment is not required

Reviewer 02Rating 4Confidence 4

Strengths

- Monitoring the harmfulness of the LRMs' final responses based on the CoT procedure is interesting. - Although simple, the authors compare multiple baselines and different settings of linear probing (e.g., future-trained and present-trained).

Weaknesses

- About CoT monitoring methods: - I wonder what the differences are between the fine-tuned BERT classifier and the fine-tuned harmfulness classifier, since both are conducting binary classification. - I'm not sure about your settings. What do you try to predict? For each CoT index, do you try to predict the harmfulness of the final response without altering the original reasoning procedure, or will you interrupt and generate an instant response at each CoT step, and then do the prediction （a

Reviewer 03Rating 2Confidence 4

Strengths

- The paper is easy to read and well structured. - This paper tackles a relevant problem that has gained significant attention recently. - The authors employ three relevant datasets for their experimental procedure, which seems well executed overall and the analysis of results is well conducted.

Weaknesses

- There seems to be an important body of literature missed in this work's background. Real time safety alignment prediction is not novel, and it's been well understood that simple linear discriminators can perform well for this task (see references below). - Related work focuses on Reasoning and Chain-of-Thought literature, while ignoring a large bulk of related work on controlled text generation. For example, how does this work sufficiently differ from [1] and [2] for it to be considered a wort

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Semantic Web and Ontologies