Sparse Attention Decomposition Applied to Circuit Tracing

Gabriel Franco; Mark Crovella

arXiv:2410.00340·cs.LG·October 30, 2024

Sparse Attention Decomposition Applied to Circuit Tracing

Gabriel Franco, Mark Crovella

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method to analyze GPT-2 small's attention mechanisms by identifying sparsely encoded features in singular vectors, revealing communication pathways and circuit redundancies during the IOI task.

Contribution

It presents a novel approach using sparse attention decomposition to isolate and identify features used for communication among attention heads in GPT-2 small.

Findings

01

Sparse signals enable clear separation of communication features.

02

Identified redundant paths in GPT-2 attention circuits.

03

Enhanced understanding of attention head interactions during IOI.

Abstract

Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It's frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the problem is to show that these features are very often sparsely coded in the singular vectors of attention head matrices. We characterize the dimensionality and occurrence of these signals across the attention heads in GPT-2 small when used for the Indirect Object Identification (IOI) task. The sparse encoding of signals, as provided by attention head singular vectors, allows for efficient separation of signals from the residual background and straightforward identification of communication…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

(1) The theoretical proofs in this paper are remarkable. (2) The figures and tables in the paper are visually appealing, which enhances readability to a certain paper. (3) The related work and literature survey are adequate and well organized.

Weaknesses

(1) My big concern is that the paper may be technically obsolete. More mainstream experiments are now being conducted in GPT 4o and GPT o1-based settings. I don't understand why the authors are still conducting experiments on GPT 2. The gap between GPT 2 and GPT 4o and GPT o1-based methods is huge, so I think the experiments and the motivation are very limited, and the techniques in the paper may not be valid for the GPT 4 and GPT o1-based settings. (2) The writing of the article is obscure. May

Reviewer 02Rating 6Confidence 3

Strengths

1, Novelty: This paper addresses the previously challenging issue of identifying and interpreting the complex interactions between attention heads in Transformer models by proposing a novel SVD-based sparse decomposition method. 2, Interpretability: The paper uncovers the communication pathways between attention heads in Transformer models, enhancing researchers' understanding of the model's internal workings.

Weaknesses

1, Possible computational complexity: The computation of SVD and sparse decomposition is usually very complex and requires a lot of computing resources. What is the computational complexity and computing resources consumed in this paper? Have the factors related to computational complexity and required computing resources been considered? 2, Lack of open source code: The author should provide source code to facilitate others to reproduce and verify.

Reviewer 03Rating 3Confidence 4

Strengths

The main contribution of this paper lies in introducing a more scalable approach for interpreting the information flow within Transformers. Specifically, - The use of SVD in circuit tracing seems simple and effective. - The paper identifies new functionally important components (e.g., attention head 2,8) and provides a detailed analysis of redundant pathways in the model. - Overall, the paper is well-structured and written.

Weaknesses

The paper can be improved in several major aspects. - The technical novelty seems limited. The idea of using dimensionality reduction (SVD in particular) to interpret and visualize models is not new. - The study focuses on the attention layers. Do the MLP layers and layer normalization contribute to the change of causality relationships from layer to layer? - The analysis is limited to a specific model GPT-2 small and a specific task (IOI). How do the findings generalize to other settings? - M

Code & Models

Repositories

gaabrielfranco/sparse-attention-decomposition
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and Analog Circuit Testing · Advanced Malware Detection Techniques · Adversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Cosine Annealing · Byte Pair Encoding · Softmax · Dropout · Attention Dropout · Dense Connections