Does Circuit Analysis Interpretability Scale? Evidence from Multiple   Choice Capabilities in Chinchilla

Tom Lieberum; Matthew Rahtz; J\'anos Kram\'ar; Neel Nanda; Geoffrey; Irving; Rohin Shah; Vladimir Mikulik

arXiv:2307.09458·cs.LG·July 25, 2023·6 cites

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Tom Lieberum, Matthew Rahtz, J\'anos Kram\'ar, Neel Nanda, Geoffrey, Irving, Rohin Shah, Vladimir Mikulik

PDF

Open Access

TL;DR

This paper investigates whether circuit analysis techniques can scale to large language models like Chinchilla 70B, demonstrating their effectiveness in understanding multiple-choice question answering mechanisms and identifying key output nodes.

Contribution

It extends circuit analysis methods to a large-scale model, showing their scalability and providing insights into the model's handling of multiple-choice questions and answer label representations.

Findings

01

Circuit analysis techniques scale to Chinchilla 70B.

02

Output nodes like attention heads and MLPs can be identified and categorized.

03

Query and key subspaces encode enumeration features.

Abstract

\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science

MethodsChinchilla