Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh

TL;DR
This paper introduces Batch-Max, a method that compresses KV caches during input processing to enable larger batch sizes and significantly improve LLM inference throughput without sacrificing accuracy.
Contribution
It proposes a novel KV cache compression technique during input processing, allowing larger batch sizes and higher throughput in limited GPU memory settings.
Findings
Enabling KV cache compression during input processing increases throughput.
Larger batch sizes are feasible with the proposed method.
Model accuracy is maintained despite compression.
Abstract
Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Network Packet Processing and Optimization
MethodsFocus
