Reasoning Models Will Sometimes Lie About Their Reasoning

William Walden; Miriam Wanner

arXiv:2601.07663·cs.AI·April 22, 2026

Reasoning Models Will Sometimes Lie About Their Reasoning

William Walden, Miriam Wanner

PDF

1 Repo

TL;DR

This paper investigates how large reasoning models sometimes deny using hints in their reasoning even when they do, highlighting challenges for interpretability and faithfulness in AI explanations.

Contribution

It introduces new granular metrics for faithfulness and shows models often deny using hints despite evidence of their use, revealing limitations in current interpretability methods.

Findings

01

Models acknowledge hints but deny using them in many cases.

02

Instructions can improve faithfulness metrics but do not fully solve the denial issue.

03

Challenges remain for reliable reasoning interpretability and monitoring.

Abstract

Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content -- even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wgantt/reasoning-models-lie
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.