Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Bilal Chughtai; Alan Cooney; Neel Nanda

arXiv:2402.07321·cs.LG·February 14, 2024·2 cites

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Bilal Chughtai, Alan Cooney, Neel Nanda

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how large language models recall factual information, revealing that multiple independent mechanisms additively contribute to correct answers, and introduces methods to analyze these mechanisms and attention heads.

Contribution

It uncovers the additive mechanisms behind factual recall in LLMs and extends attribution techniques to better understand attention head contributions.

Findings

01

Factual recall involves multiple independent additive mechanisms.

02

Mechanisms interfere constructively to produce correct answers.

03

Extended attribution methods reveal mixed attention heads from different source tokens.

Abstract

How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- The paper uses established mechanistic interpretation tools and extends them to identify mechanisms in the transformer that perform very specific purposes - The SUBJECT-head, RELATION-head, and MLP additive behaviors are established by showing consistent patterns across a range of fact queries

Weaknesses

- The paper introduction and further discussions claim that the results reported here provide a mechanistic explanation for the limitations of LLMs to learn "B is A" from training on "A is B" [1]. However, I do not see sufficient evidence to support this claim - They have shown that in the forward direction the transformer selectively promotes attributes relevant to the subject and the relation - This does not show that the transformer CANNOT/DOES NOT perform the same operations in the r

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

(1) Based on sufficient experimental results verification, the author has identified and explained the internal mechanisms of LLMs at the granularity level of attention heads and MLPs. More interestingly, it provides an explanation of the “reversal curse” phenomenon discovered in recent works. (2) This work has thoroughly discussed the related work and proposed a range of possible directions for future works.

Weaknesses

(1) There have been many works [1, 2] interpreting the model behavior of Factual Recall. It seems that the novelty is insufficient with only a deeper zooming into attention heads using similar interpretability methods. Additionally, the discovery of the additive motif is not surprising enough, as already explained in work [3] that "Attention heads can be understood as independent operations, each outputting a result which is added into the residual stream." (2) Is direct logit attribution (DLA)

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

The study tackles an important/interesting problem and the paper reports a substantial amount of experimentation.

Weaknesses

Although I believe the idea is interesting, and there may be some valuable finding in the paper, I have difficulties seeing a clear take-home message based on the results presented, and probably also due to the way they are presented. I have some concrete points of criticism listed in the comments below (with approximate order of importance). - The main claim, additivity of the multiple mechanisms, is not very clearly demonstrated in the paper. The separation of the subject/relation heads (

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Law, Economics, and Judicial Systems

MethodsFocus