Self-attention as an attractor network: transient memories without backpropagation
Francesco D'Amico, Matteo Negri

TL;DR
This paper interprets self-attention in transformers as an attractor network derived from local energy terms, enabling training without backpropagation and revealing transient memory states linked to data.
Contribution
It introduces a novel framework to view self-attention as an attractor network based on energy functions, allowing backpropagation-free training and new theoretical insights.
Findings
Self-attention can be derived from local energy terms.
The proposed model exhibits transient states correlated with data.
Training without backpropagation is feasible using this framework.
Abstract
Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function · Neural Networks and Reservoir Computing · Neural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Multi-Head Attention · Weight Decay · Linear Warmup With Cosine Annealing · Adam · Residual Connection · Byte Pair Encoding
