Overthinking the Truth: Understanding how Language Models Process False   Demonstrations

Danny Halawi; Jean-Stanislas Denain; Jacob Steinhardt

arXiv:2307.09476·cs.LG·March 13, 2024·6 cites

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

PDF

Open Access 1 Repo

TL;DR

This paper investigates how language models process false demonstrations, revealing phenomena like overthinking and false induction heads that contribute to harmful imitation, and suggests analyzing intermediate layers to mitigate such issues.

Contribution

It introduces the concepts of overthinking and false induction heads, providing mechanistic insights into how models reproduce false information during few-shot learning.

Findings

01

Overthinking occurs at a critical layer where behavior diverges with incorrect demonstrations.

02

False induction heads attend to and copy false information, contributing to overthinking.

03

Ablating false induction heads reduces harmful imitation behaviors.

Abstract

Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dannyallover/overthinking_the_truth
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning