Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot; Adrian {\L}a\'ncucki; Marcin Chochowski; David Tarjan,; Edoardo M. Ponti

arXiv:2403.09636·cs.CL·July 24, 2024·1 cites

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot, Adrian {\L}a\'ncucki, Marcin Chochowski, David Tarjan,, Edoardo M. Ponti

PDF

Open Access 1 Repo 4 Models

TL;DR

This paper introduces Dynamic Memory Compression (DMC), a method to compress key-value caches in large language models during inference, significantly improving throughput without performance loss.

Contribution

DMC is a novel online cache compression technique that retrofits pre-trained LLMs, enabling up to 7x inference speedup with minimal retraining and no extra parameters.

Findings

01

Achieves up to 7x throughput increase during inference.

02

Maintains original model performance with up to 4x cache compression.

03

Outperforms existing cache management methods like GQA and H2O.

Abstract

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/Megatron-LM
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Algorithms and Data Compression

MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Grouped-query attention