HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

Cheng Luo; Zefan Cai; Hanshi Sun; Jinqi Xiao; Bo Yuan; Wen Xiao,; Junjie Hu; Jiawei Zhao; Beidi Chen; Anima Anandkumar

arXiv:2502.12574·cs.LG·February 19, 2025

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao,, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar

PDF

Open Access 1 Repo

TL;DR

HEADINFER introduces a head-wise offloading technique that significantly reduces GPU memory usage during large language model inference by offloading key-value caches to CPU RAM, enabling longer context processing on consumer hardware.

Contribution

The paper presents a novel head-wise offloading strategy for KV caches in LLMs, achieving substantial memory reduction without sacrificing computational efficiency.

Findings

01

Reduced KV cache memory from 128 GB to 1 GB

02

Lowered total GPU memory from 207 GB to 17 GB

03

Enabled 4-million-token inference on a 24GB GPU

Abstract

Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wdlctc/headinfer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques

MethodsSoftmax · Attention Is All You Need