Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Michael R. Metel; Boxing Chen; Mehdi Rezagholizadeh

arXiv:2412.05693·cs.CL·July 4, 2025

Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh

PDF

Open Access

TL;DR

This paper introduces Batch-Max, a method that compresses KV caches during input processing to enable larger batch sizes and significantly improve LLM inference throughput without sacrificing accuracy.

Contribution

It proposes a novel KV cache compression technique during input processing, allowing larger batch sizes and higher throughput in limited GPU memory settings.

Findings

01

Enabling KV cache compression during input processing increases throughput.

02

Larger batch sizes are feasible with the proposed method.

03

Model accuracy is maintained despite compression.

Abstract

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Network Packet Processing and Optimization

MethodsFocus