HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao,, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar

TL;DR
HEADINFER introduces a head-wise offloading technique that significantly reduces GPU memory usage during large language model inference by offloading key-value caches to CPU RAM, enabling longer context processing on consumer hardware.
Contribution
The paper presents a novel head-wise offloading strategy for KV caches in LLMs, achieving substantial memory reduction without sacrificing computational efficiency.
Findings
Reduced KV cache memory from 128 GB to 1 GB
Lowered total GPU memory from 207 GB to 17 GB
Enabled 4-million-token inference on a 24GB GPU
Abstract
Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
MethodsSoftmax · Attention Is All You Need
