Linear Transformers Are Secretly Fast Weight Programmers

Imanol Schlag; Kazuki Irie; J\"urgen Schmidhuber

arXiv:2102.11174·cs.LG·June 10, 2021·24 cites

Linear Transformers Are Secretly Fast Weight Programmers

Imanol Schlag, Kazuki Irie, J\"urgen Schmidhuber

PDF

Open Access 5 Repos 1 Models 2 Videos

TL;DR

This paper reveals that linear transformers function as fast weight programmers, enabling dynamic memory manipulation and learning rate adjustment, with experiments showing improved performance on various tasks.

Contribution

It establishes the equivalence between linearised self-attention and fast weight programming, introduces a delta rule-like update, and proposes a new kernel for linearising attention.

Findings

01

Linear transformers can be viewed as fast weight programmers.

02

The proposed delta rule-like update improves learning capabilities.

03

Experiments demonstrate benefits on retrieval, translation, and language modeling tasks.

Abstract

We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values). Such Fast Weight Programmers (FWPs) learn to manipulate the contents of a finite memory and dynamically interact with it. We infer a memory capacity limitation of recent linearised softmax attention variants, and replace the purely additive outer products by a delta rule-like programming instruction, such that the FWP can more easily learn to correct the current mapping from keys to values. The FWP also learns to compute dynamically changing learning rates. We also propose a new kernel function to linearise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
guiferrarib/genesis-152m-instruct
model· ♡ 17
♡ 17

Videos

Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)· youtube

Linear Transformers Are Secretly Fast Weight Programmers· slideslive

Taxonomy

TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing

MethodsSoftmax