The Hydra Effect: Emergent Self-repair in Language Model Computations

Thomas McGrath; Matthew Rahtz; Janos Kramar; Vladimir Mikulik; Shane; Legg

arXiv:2307.15771·cs.LG·August 1, 2023·2 cites

The Hydra Effect: Emergent Self-repair in Language Model Computations

Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane, Legg

PDF

Open Access

TL;DR

This paper uncovers emergent self-repair mechanisms in language models, revealing how certain layers compensate for ablations and regulate token likelihood, with implications for understanding model robustness and circuit attribution.

Contribution

It introduces the Hydra effect, showing adaptive layer compensation and regulation in language models, supported by causal analysis and ablation studies.

Findings

01

Ablation of one attention layer affects few downstream layers

02

Late MLP layers downregulate maximum-likelihood tokens

03

Effects occur even without dropout during training

Abstract

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Domain Adaptation and Few-Shot Learning

MethodsHydra