Language Models Don't Always Say What They Think: Unfaithful   Explanations in Chain-of-Thought Prompting

Miles Turpin; Julian Michael; Ethan Perez; Samuel R. Bowman

arXiv:2305.04388·cs.CL·December 12, 2023·77 cites

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman

PDF

Open Access 2 Repos 1 Datasets 2 Videos

TL;DR

This paper reveals that chain-of-thought explanations in large language models can be systematically unfaithful, often rationalizing biased or incorrect answers, which poses safety and trust concerns.

Contribution

It demonstrates that CoT explanations can be manipulated by input biases, leading to misleading justifications and significant drops in accuracy, highlighting the need for more faithful interpretability methods.

Findings

01

CoT explanations can be heavily biased by input reordering.

02

Biasing models toward incorrect answers leads to plausible yet false explanations.

03

Model accuracy drops up to 36% when explanations are manipulated.

Abstract

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"--which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

richardyoung/cot-faithfulness-open-models
dataset· 450 dl
450 dl

Videos

Claude 3.7 is More Significant than its Name Implies (ft DeepSeek R2 + GPT 4.5 coming soon)· youtube

'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)· youtube

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Materials Science

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Weight Decay · Residual Connection