SPLAT: A framework for optimised GPU code-generation for SParse reguLar   ATtention

Ahan Gupta; Yueming Yuan; Devansh Jain; Yuhao Ge; David Aponte; Yanqi; Zhou; Charith Mendis

arXiv:2407.16847·cs.PL·July 25, 2024

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi, Zhou, Charith Mendis

PDF

TL;DR

SPLAT introduces a novel GPU code-generation framework that efficiently supports diverse sparse multi-head self-attention patterns, significantly improving inference speed for NLP and vision tasks.

Contribution

It proposes a new affine-compressed-sparse-row format and a code-generation scheme that enable high-performance, general sparse-MHSA implementations on GPUs.

Findings

01

Achieves 2.05x speedup over Triton kernels.

02

Achieves 4.05x speedup over TVM kernels.

03

Supports diverse sparse-MHSA patterns with high efficiency.

Abstract

Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need