Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot, Adrian {\L}a\'ncucki, Marcin Chochowski, David Tarjan,, Edoardo M. Ponti

TL;DR
This paper introduces Dynamic Memory Compression (DMC), a method to compress key-value caches in large language models during inference, significantly improving throughput without performance loss.
Contribution
DMC is a novel online cache compression technique that retrofits pre-trained LLMs, enabling up to 7x inference speedup with minimal retraining and no extra parameters.
Findings
Achieves up to 7x throughput increase during inference.
Maintains original model performance with up to 4x cache compression.
Outperforms existing cache management methods like GQA and H2O.
Abstract
Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Algorithms and Data Compression
MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Grouped-query attention
