TL;DR
This paper introduces Key-Value Means (KVM), a novel attention mechanism for transformers that supports expandable memory, efficient long-context processing, and can be implemented without custom kernels, combining benefits of transformers and RNNs.
Contribution
The authors propose KVM, a new block-recurrence attention method enabling expandable context memory with efficient training and inference, and demonstrate its effectiveness and implementability.
Findings
KVM achieves competitive long-context performance with subquadratic prefill time.
KVM supports chunk-wise parallelizable training and prefill operations.
KVM can be integrated into layers to reduce memory and improve long-context decoding.
Abstract
We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
