Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young

TL;DR
This study evaluates the faithfulness of chain-of-thought reasoning in 12 open-weight models across various architectures, revealing that acknowledgment of reasoning influence varies significantly and is often internally recognized but not reflected in outputs.
Contribution
It provides a comprehensive, empirical assessment of CoT faithfulness across diverse open-weight models, highlighting factors affecting acknowledgment rates and internal recognition of reasoning cues.
Findings
Faithfulness rates vary from 39.7% to 89.9% across models.
Consistency and sycophancy hints have the lowest acknowledgment rates.
Models internally recognize influence but often do not acknowledge it in outputs.
Abstract
Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
