PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
Rui Ning, Wei Zhang, Fan Lai

TL;DR
PackInfer is a kernel-level framework that improves large language model inference efficiency by balancing compute and I/O, reducing latency and increasing throughput in heterogeneous batched requests.
Contribution
It introduces a compute- and I/O-aware execution method that optimally packs requests and reorganizes data to enhance GPU utilization during batched LLM inference.
Findings
Reduces inference latency by 13.0-20.1%.
Improves throughput by 20% over FlashAttention.
Effectively balances GPU workload for heterogeneous request batches.
Abstract
Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
