PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference

Rui Ning; Wei Zhang; Fan Lai

arXiv:2602.06072·cs.DC·February 9, 2026

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference

Rui Ning, Wei Zhang, Fan Lai

PDF

Open Access

TL;DR

PackInfer is a kernel-level framework that improves large language model inference efficiency by balancing compute and I/O, reducing latency and increasing throughput in heterogeneous batched requests.

Contribution

It introduces a compute- and I/O-aware execution method that optimally packs requests and reorganizes data to enhance GPU utilization during batched LLM inference.

Findings

01

Reduces inference latency by 13.0-20.1%.

02

Improves throughput by 20% over FlashAttention.

03

Effectively balances GPU workload for heterogeneous request batches.

Abstract

Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications