Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum, Matthew Rahtz, J\'anos Kram\'ar, Neel Nanda, Geoffrey, Irving, Rohin Shah, Vladimir Mikulik

TL;DR
This paper investigates whether circuit analysis techniques can scale to large language models like Chinchilla 70B, demonstrating their effectiveness in understanding multiple-choice question answering mechanisms and identifying key output nodes.
Contribution
It extends circuit analysis methods to a large-scale model, showing their scalability and providing insights into the model's handling of multiple-choice questions and answer label representations.
Findings
Circuit analysis techniques scale to Chinchilla 70B.
Output nodes like attention heads and MLPs can be identified and categorized.
Query and key subspaces encode enumeration features.
Abstract
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
MethodsChinchilla
