Chain-of-Thought Unfaithfulness as Disguised Accuracy
Oliver Bentham, Nathan Stringham, Ana Marasovi\'c

TL;DR
This paper investigates the reliability of a metric for measuring how well Chain-of-Thought (CoT) generations reflect a model's internal reasoning, revealing that normalized faithfulness correlates with accuracy and may not be a valid measure of true faithfulness.
Contribution
The study replicates previous scaling experiments, normalizes the faithfulness metric, and questions its validity by showing its strong correlation with accuracy.
Findings
Normalized faithfulness drops for smaller models
Strong correlation ($R^2$=0.74) between normalized faithfulness and accuracy
Scaling trends are reproducible under specific conditions
Abstract
Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPsychology of Moral and Emotional Judgment · Epistemology, Ethics, and Metaphysics
MethodsALIGN
