FlashInfer: Efficient and Customizable Attention Engine for LLM   Inference Serving

Zihao Ye; Lequn Chen; Ruihang Lai; Wuwei Lin; Yineng Zhang; Stephanie; Wang; Tianqi Chen; Baris Kasikci; Vinod Grover; Arvind Krishnamurthy; Luis; Ceze

arXiv:2501.01005·cs.DC·April 23, 2025·2 cites

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie, Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis, Ceze

PDF

Open Access 1 Repo

TL;DR

FlashInfer is a customizable, efficient attention engine for large language model inference that optimizes memory access, adapts to various settings via JIT compilation, and significantly reduces latency in diverse scenarios.

Contribution

It introduces a flexible attention engine with block-sparse formats, JIT customization, and load-balanced scheduling, integrated into leading LLM serving frameworks.

Findings

01

Achieves 29-69% inter-token-latency reduction.

02

Reduces latency by 28-30% for long-context inference.

03

Provides 13-17% speedup in parallel generation scenarios.

Abstract

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

flashinfer-ai/flashinfer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Scientific Computing and Data Management · Digital and Cyber Forensics

MethodsSoftmax · Attention Is All You Need