Self-attention as an attractor network: transient memories without   backpropagation

Francesco D'Amico; Matteo Negri

arXiv:2409.16112·cs.LG·September 25, 2024

Self-attention as an attractor network: transient memories without backpropagation

Francesco D'Amico, Matteo Negri

PDF

Open Access 1 Repo

TL;DR

This paper interprets self-attention in transformers as an attractor network derived from local energy terms, enabling training without backpropagation and revealing transient memory states linked to data.

Contribution

It introduces a novel framework to view self-attention as an attractor network based on energy functions, allowing backpropagation-free training and new theoretical insights.

Findings

01

Self-attention can be derived from local energy terms.

02

The proposed model exhibits transient states correlated with data.

03

Training without backpropagation is feasible using this framework.

Abstract

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

francill99/self_attention_attractor_network
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural dynamics and brain function · Neural Networks and Reservoir Computing · Neural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Multi-Head Attention · Weight Decay · Linear Warmup With Cosine Annealing · Adam · Residual Connection · Byte Pair Encoding