Attention Approximates Sparse Distributed Memory
Trenton Bricken, Cengiz Pehlevan

TL;DR
This paper reveals that Transformer Attention functions similarly to Sparse Distributed Memory under certain conditions, offering new insights into its effectiveness and biological plausibility.
Contribution
It establishes a formal connection between Transformer Attention and Sparse Distributed Memory, providing new interpretations and understanding of Attention mechanisms.
Findings
Transformer Attention closely relates to SDM under specific data conditions.
Pre-trained GPT2 models satisfy these conditions, validating the theoretical link.
Provides new computational and biological insights into Attention mechanisms.
Abstract
While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well. Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva's Sparse Distributed Memory (SDM), a biologically plausible associative memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Explainable Artificial Intelligence (XAI) · Neural dynamics and brain function
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding
