Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones

TL;DR
The paper introduces FwPKM, a sparse fast-weight memory layer that enables efficient, rapid memorization and retrieval of large key-value associations in language models, improving long-context understanding.
Contribution
FwPKM offers a novel sparse memory layer that performs local gradient updates at both training and inference, balancing storage capacity and computational efficiency.
Findings
Significant perplexity reductions on long-context datasets.
Effective episodic memory complementing semantic memory.
Generalizes to 128K-token contexts from 4K-token training sequences.
Abstract
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While softmax attention offers unbounded storage at prohibitive quadratic cost, linear variants are more efficient but suffer from limited, fixed-size storage. We introduce Fast-weight Product Key Memory (FwPKM), a sparse fast-weight memory layer that resolves this tension. FwPKM updates sparsely activated parameters at both training and inference time using chunk-level gradient descent on a local memory-rewrite objective. This performs Test-Time Training (TTT)-style gradient updates on activated slots in a sparse memory, enabling rapid memorization and retrieval of many new key-value associations while keeping per-token compute low and fixed. Experiments show that FwPKM functions as an effective episodic memory that complements the semantic memory of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Parallel Computing and Optimization Techniques
