FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

Bingzhe Zhao; Ke Cheng; Aomufei Yuan; Yuxuan Tian; Ruiguang Zhong; Chengchen Hu; Tong Yang; Lian Yu

arXiv:2502.15804·cs.DC·May 20, 2025

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu

PDF

Open Access

TL;DR

FairKV addresses load imbalance in multi-GPU Transformer inference caused by imbalanced KV cache compression, using a novel Fair-Copying technique to improve throughput and resource utilization.

Contribution

We introduce FairKV, a novel method that ensures fair memory usage among attention heads in multi-GPU systems with imbalanced KV cache compression.

Findings

01

FairKV increases throughput by 1.66x on LLaMA 70b and Mistral 24b models.

02

FairKV mitigates load imbalance across GPUs during inference.

03

The method is effective in large-scale Transformer models.

Abstract

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Data Storage Technologies

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer