Fast-weight Product Key Memory

Tianyu Zhao; Llion Jones

arXiv:2601.00671·cs.CL·February 24, 2026

Fast-weight Product Key Memory

Tianyu Zhao, Llion Jones

PDF

Open Access

TL;DR

The paper introduces FwPKM, a sparse fast-weight memory layer that enables efficient, rapid memorization and retrieval of large key-value associations in language models, improving long-context understanding.

Contribution

FwPKM offers a novel sparse memory layer that performs local gradient updates at both training and inference, balancing storage capacity and computational efficiency.

Findings

01

Significant perplexity reductions on long-context datasets.

02

Effective episodic memory complementing semantic memory.

03

Generalizes to 128K-token contexts from 4K-token training sequences.

Abstract

Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While softmax attention offers unbounded storage at prohibitive quadratic cost, linear variants are more efficient but suffer from limited, fixed-size storage. We introduce Fast-weight Product Key Memory (FwPKM), a sparse fast-weight memory layer that resolves this tension. FwPKM updates sparsely activated parameters at both training and inference time using chunk-level gradient descent on a local memory-rewrite objective. This performs Test-Time Training (TTT)-style gradient updates on activated slots in a sparse memory, enabling rapid memorization and retrieval of many new key-value associations while keeping per-token compute low and fixed. Experiments show that FwPKM functions as an effective episodic memory that complements the semantic memory of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Parallel Computing and Optimization Techniques