Linear Transformers Are Secretly Fast Weight Programmers
Imanol Schlag, Kazuki Irie, J\"urgen Schmidhuber

TL;DR
This paper reveals that linear transformers function as fast weight programmers, enabling dynamic memory manipulation and learning rate adjustment, with experiments showing improved performance on various tasks.
Contribution
It establishes the equivalence between linearised self-attention and fast weight programming, introduces a delta rule-like update, and proposes a new kernel for linearising attention.
Findings
Linear transformers can be viewed as fast weight programmers.
The proposed delta rule-like update improves learning capabilities.
Experiments demonstrate benefits on retrieval, translation, and language modeling tasks.
Abstract
We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values). Such Fast Weight Programmers (FWPs) learn to manipulate the contents of a finite memory and dynamically interact with it. We infer a memory capacity limitation of recent linearised softmax attention variants, and replace the purely additive outer products by a delta rule-like programming instruction, such that the FWP can more easily learn to correct the current mapping from keys to values. The FWP also learns to compute dynamically changing learning rates. We also propose a new kernel function to linearise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing
MethodsSoftmax
