Trellis: Learning to Compress Key-Value Memory in Attention Models
Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni

TL;DR
Trellis is a new Transformer architecture that dynamically compresses its key-value memory at test time, reducing memory usage and improving performance on long sequences across various tasks.
Contribution
It introduces a learnable, recursive compression mechanism with bounded memory for Transformers, enabling efficient handling of long sequences and dynamic memory updates.
Findings
Outperforms strong baselines on language modeling and reasoning tasks.
Performance improves with longer sequences, showing scalability.
Efficiently updates memory at test time using online gradient descent.
Abstract
Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Machine Learning in Healthcare · Topic Modeling
