Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
Kerem Zaman, Shashank Srivastava

TL;DR
This paper challenges the notion that chain-of-thought explanations are unfaithful if they omit hints, showing that many are actually faithful and that evaluation metrics should be broader.
Contribution
It introduces a new faithful@k metric, applies causal mediation analysis, and argues for a broader interpretability toolkit beyond hint-based evaluations.
Findings
Larger inference budgets increase hint verbalization up to 90%.
Many CoTs flagged as unfaithful are actually faithful according to other metrics.
Hint omission alone does not prove unfaithfulness.
Abstract
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
