Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers
Kazuki Irie, Morris Yau, Samuel J. Gershman

TL;DR
This paper introduces hybrid memory architectures combining key-value and fast weight memories in transformers, enhancing sequence processing by leveraging their complementary strengths for language modeling, algorithmic, and reinforcement learning tasks.
Contribution
The paper proposes and compares three methods to effectively blend quadratic and linear transformer memory systems, demonstrating improved performance across multiple tasks.
Findings
Hybrid memory systems outperform individual components in language modeling.
Hybrid approaches enable processing of longer sequences with better recall.
Experimental results show improved performance in reinforcement learning environments.
Abstract
We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with fast weight memory through dynamic synaptic modulation (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system, differing in how and when input information is delivered to each system, to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
