Understanding Factual Recall in Transformers via Associative Memories

Eshaan Nichani; Jason D. Lee; Alberto Bietti

arXiv:2412.06538·cs.LG·December 10, 2024

Understanding Factual Recall in Transformers via Associative Memories

Eshaan Nichani, Jason D. Lee, Alberto Bietti

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that shallow transformers can achieve near-optimal factual recall capacity by combining associative memories, with theoretical proofs and a synthetic task showing 100% accuracy when parameters scale linearly.

Contribution

The work proves linear scaling of storage capacity in associative memories within transformers and introduces a synthetic task to analyze their factual recall capabilities.

Findings

01

Linear scaling of storage capacity in associative memories.

02

Transformers can achieve 100% accuracy on a synthetic factual recall task.

03

Sequential learning behavior observed in gradient flow analysis.

Abstract

Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

(S1) **Possibly high impact**. If the analyses performed in this paper are shown to hold for deeper transformers (with shared or unshared weights) using softmax activations in the attention and bigger datasets, the paper has the possibility to be high impact -- it would solidify the benefit of viewing transformers as formal Associative Memories, uniting the LLM community and the more niche physics- and math- communities studying associative memories. (S2) **Good, formal definition of Hallucina

Weaknesses

(W1) **Unclear application to real Transformers and datasets**. (dual to (S1)) The simplified Transformer architecture studied theoretically and empirically in this work is smaller than real Transformers, uses only a single update step for much of the analysis, only uses a synthetic dataset to validate the theory, and drops the softmax from the attention. It is unclear how the theory would generalize to unshared weights and real datasets where the task is to reproduce sequences of data. (W2) *

Reviewer 02Rating 8Confidence 3

Strengths

1. The theoretical contributions are solid

Weaknesses

1. The theoretical toy model is rather simple (MLP and one-layer transformer). It is unclear how it generalizes to multi-layer transformer. 2. If I understand correctly, the synthetic setting relies on the fact that noise tokens and subject tokens are disjoint. In reality, usually the last token (not eos in practice) should be somewhat relevant for token selection (for example, “in” as the last token would look for locations). It doesn’t make too much sense to let model do next token prediction

Reviewer 03Rating 6Confidence 4

Strengths

- This work offers a theoretical understanding of how shallow Transformers handle factual recall, specifically revealing linear and MLP associative memory capacity in managing storage. - The gradient flow analysis, particularly the identification of an intermediate "hallucination" phase, adds depth to the understanding of transformer training dynamics. - The paper demonstrates mathematical rigor in its proofs regarding associative memory storage capacity

Weaknesses

- The findings may be somewhat narrow, given the reliance on synthetic tasks. While the theoretical insights are valuable, it remains uncertain how well these translate to complex, real-world scenarios involving non-random and interdependent factual associations. - The study is centered on shallow models, raising questions about the applicability of these findings to deeper, large-scale transformers commonly used in research and industry. - Certain sections, particularly those describing empiric

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need