Attention Approximates Sparse Distributed Memory

Trenton Bricken; Cengiz Pehlevan

arXiv:2111.05498·cs.LG·January 19, 2022·6 cites

Attention Approximates Sparse Distributed Memory

Trenton Bricken, Cengiz Pehlevan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that Transformer Attention functions similarly to Sparse Distributed Memory under certain conditions, offering new insights into its effectiveness and biological plausibility.

Contribution

It establishes a formal connection between Transformer Attention and Sparse Distributed Memory, providing new interpretations and understanding of Attention mechanisms.

Findings

01

Transformer Attention closely relates to SDM under specific data conditions.

02

Pre-trained GPT2 models satisfy these conditions, validating the theoretical link.

03

Provides new computational and biological insights into Attention mechanisms.

Abstract

While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well. Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva's Sparse Distributed Memory (SDM), a biologically plausible associative memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trentbrick/attention-approximates-sdm
jaxOfficial

Videos

Attention Approximates Sparse Distributed Memory· slideslive

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Explainable Artificial Intelligence (XAI) · Neural dynamics and brain function

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding