Online Vector Quantized Attention
Nick Alonso, Tomas Figliolia, Beren Millidge

TL;DR
The paper introduces OVQ-attention, a sequence mixing layer that balances efficiency and long-context processing by using sparse memory updates, outperforming linear attention baselines in long sequence tasks.
Contribution
It develops a novel online vector-quantized attention mechanism with sparse memory updates, enabling long-context processing with low compute and memory costs.
Findings
Significant improvements over linear attention baselines.
Competitive performance on sequences up to 64k length.
Uses a small fraction of memory compared to full self-attention.
Abstract
Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning · Topic Modeling
