Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Saksham Rathi, Preeti, Mythili Vutukuru

TL;DR
This paper introduces Feather, a prefix-aware scheduler using reinforcement learning and a new data structure to optimize batch formation in large language model inference, significantly improving throughput.
Contribution
Feather employs RL and a novel Chunked Hash Tree to optimize batch formation based on prefix homogeneity, outperforming existing schedulers.
Findings
Feather achieves 2-10x higher throughput compared to existing schedulers.
It reduces KV cache accesses, surpassing prefix-aware attention kernels.
Performance gains are maintained even with workloads lacking prefix sharing.
Abstract
Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
