FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Aniruddha Nrusimha; William Brandon; Mayank Mishra; Yikang Shen; Rameswar Panda; Jonathan Ragan-Kelley; and Yoon Kim

arXiv:2505.22758·cs.LG·December 5, 2025

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, and Yoon Kim

PDF

Open Access

TL;DR

FlashFormer introduces a novel kernel that fuses the entire transformer forward pass, significantly accelerating low-batch inference for large language models, especially useful in edge and latency-sensitive applications.

Contribution

The paper presents FlashFormer, a new kernel that optimizes low-batch inference by fusing the entire transformer forward pass, addressing a gap in existing kernel optimization.

Findings

01

Achieves speedups over existing inference kernels across various model sizes.

02

Effective in different quantization settings for large language models.

03

Reduces latency and improves efficiency in low-batch inference scenarios.

Abstract

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Speech Recognition and Synthesis