FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, and Yoon Kim

TL;DR
FlashFormer introduces a novel kernel that fuses the entire transformer forward pass, significantly accelerating low-batch inference for large language models, especially useful in edge and latency-sensitive applications.
Contribution
The paper presents FlashFormer, a new kernel that optimizes low-batch inference by fusing the entire transformer forward pass, addressing a gap in existing kernel optimization.
Findings
Achieves speedups over existing inference kernels across various model sizes.
Effective in different quantization settings for large language models.
Reduces latency and improves efficiency in low-batch inference scenarios.
Abstract
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Speech Recognition and Synthesis
