Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
Parsa Mirtaheri, Mikhail Belkin

TL;DR
This paper investigates motivated reasoning in large language models, showing that internal activation probing can detect biased rationalizations before and after chain-of-thought generation more effectively than analyzing the generated explanations.
Contribution
It introduces activation probing methods to detect motivated reasoning in LLMs, demonstrating early detection and improved reliability over traditional CoT monitoring.
Findings
Pre-generation probes predict motivated reasoning as effectively as full CoT analysis.
Post-generation probes outperform CoT monitors in detecting motivated reasoning.
Internal activation signals can reveal biased reasoning even when CoT explanations do not.
Abstract
Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model's residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Neurobiology of Language and Bilingualism
