Fused3S: Fast Sparse Attention on Tensor Cores

Zitong Li; Aparna Chandramowlishwaran

arXiv:2505.08098·cs.DC·May 14, 2025

Fused3S: Fast Sparse Attention on Tensor Cores

Zitong Li, Aparna Chandramowlishwaran

PDF

1 Repo

TL;DR

Fused3S introduces a novel fused sparse matrix operation algorithm that significantly accelerates sparse attention computations on GPUs by maximizing tensor core utilization and reducing data movement, benefiting graph neural network models.

Contribution

It is the first to jointly optimize the three sparse matrix operations in the 3S pattern, achieving substantial speedups over previous methods on modern GPUs.

Findings

01

Achieves up to 16.3x speedup on H100 GPUs.

02

Accelerates Graph Transformer inference by up to 5.36x.

03

Outperforms existing sparse operation methods across multiple datasets and GPU architectures.

Abstract

Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse matrix multiplication (SpMM). Efficiently executing the 3S computational pattern on modern GPUs remains challenging due to (a) the mismatch between unstructured sparsity and tensor cores optimized for dense operations, and (b) the high cost of data movement. Previous works have optimized these sparse operations individually or addressed one of these challenges. This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves $1.6 - 16.3 \times$ and $1.5 - 14 \times$ speedup…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HPCForge/Fused3S
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Laplacian EigenMap · Linear Layer · Laplacian Positional Encodings · Multi-Head Attention · Dense Connections · Graph Transformer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer