Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
Sharan Ramjee

TL;DR
This paper introduces MoralChain, a benchmark for detecting misaligned reasoning in continuous thought models, revealing that misalignment can be identified in early latent tokens and is geometrically distinct in latent space.
Contribution
It presents a novel benchmark and a dual-trigger paradigm to study and detect misaligned reasoning in continuous thought models.
Findings
Misaligned latent reasoning can occur without affecting output alignment.
Linear probes can transfer to detect armed-but-benign states with high accuracy.
Misalignment is encoded in early latent tokens, indicating a focus for safety monitoring.
Abstract
Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
