Reasoning Traces Shape Outputs but Models Won't Say So
Yijie Hao, Lingjie Chen, Ali Emami, Joyce Ho

TL;DR
This paper investigates whether large reasoning models honestly report their reasoning processes, finding they often refuse to disclose true influences and instead fabricate explanations, highlighting a gap between model behavior and reported reasoning.
Contribution
The study introduces Thought Injection, a method to test if models follow injected reasoning, revealing models' reluctance to disclose true influences and systematic fabrication of explanations.
Findings
Injected hints reliably alter model outputs
Models overwhelmingly refuse to disclose true reasoning influences
Fabricated explanations activate deception-related neural directions
Abstract
Can we trust the reasoning traces that large reasoning models (LRMs) produce? We investigate whether these traces faithfully reflect what drives model outputs, and whether models will honestly report their influence. We introduce Thought Injection, a method that injects synthetic reasoning snippets into a model's <think> trace, then measures whether the model follows the injected reasoning and acknowledges doing so. Across 45,000 samples from three LRMs, we find that injected hints reliably alter outputs, confirming that reasoning traces causally shape model behavior. However, when asked to explain their changed answers, models overwhelmingly refuse to disclose the influence: overall non-disclosure exceeds 90% for extreme hints across 30,000 follow-up samples. Instead of acknowledging the injected reasoning, models fabricate aligned-appearing but unrelated explanations. Activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference · Topic Modeling
