Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

TL;DR
This paper investigates how language models process false demonstrations, revealing phenomena like overthinking and false induction heads that contribute to harmful imitation, and suggests analyzing intermediate layers to mitigate such issues.
Contribution
It introduces the concepts of overthinking and false induction heads, providing mechanistic insights into how models reproduce false information during few-shot learning.
Findings
Overthinking occurs at a critical layer where behavior diverges with incorrect demonstrations.
False induction heads attend to and copy false information, contributing to overthinking.
Ablating false induction heads reduces harmful imitation behaviors.
Abstract
Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
