FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie, Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis, Ceze

TL;DR
FlashInfer is a customizable, efficient attention engine for large language model inference that optimizes memory access, adapts to various settings via JIT compilation, and significantly reduces latency in diverse scenarios.
Contribution
It introduces a flexible attention engine with block-sparse formats, JIT customization, and load-balanced scheduling, integrated into leading LLM serving frameworks.
Findings
Achieves 29-69% inter-token-latency reduction.
Reduces latency by 28-30% for long-context inference.
Provides 13-17% speedup in parallel generation scenarios.
Abstract
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Scientific Computing and Data Management · Digital and Cyber Forensics
MethodsSoftmax · Attention Is All You Need
