SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention
Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi, Zhou, Charith Mendis

TL;DR
SPLAT introduces a novel GPU code-generation framework that efficiently supports diverse sparse multi-head self-attention patterns, significantly improving inference speed for NLP and vision tasks.
Contribution
It proposes a new affine-compressed-sparse-row format and a code-generation scheme that enable high-performance, general sparse-MHSA implementations on GPUs.
Findings
Achieves 2.05x speedup over Triton kernels.
Achieves 4.05x speedup over TVM kernels.
Supports diverse sparse-MHSA patterns with high efficiency.
Abstract
Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
