SparkAttention: High-Performance Multi-Head Attention for Large Models on Volta GPU Architecture

Youxuan Xu; Tong Wu; Shigang Li; Xueying Wang; Jingjing Wang

arXiv:2502.12784·cs.DC·August 22, 2025

SparkAttention: High-Performance Multi-Head Attention for Large Models on Volta GPU Architecture

Youxuan Xu, Tong Wu, Shigang Li, Xueying Wang, Jingjing Wang

PDF

Open Access

TL;DR

SparkAttention is a specialized acceleration library that significantly speeds up Multi-Head Attention training on NVIDIA Volta GPUs by leveraging Tensor Cores and kernel fusion, reducing memory access overhead.

Contribution

It introduces SparkAttention, a novel library optimized for Volta GPU Tensor Cores to accelerate Transformer MHA training, addressing a hardware-specific challenge.

Findings

01

Achieves up to 2.46x speedup over PyTorch

02

Average 1.80x acceleration on NVIDIA V100

03

Effectively reduces memory access overhead

Abstract

Transformer are widely used in various fields such as natural language processing and computer vision. However, the training time for large Transformer models can be challenging due to the Multi-Head Attention (MHA) mechanism. Especially as models become larger, training becomes more costly. So it is crucial to utilize various resources for efficient model training. Currently, NVIDIA Volta GPU is still widely used. However, because the computational shapes supported by Tensor Core Units (TCU) of Volta GPU differ from other GPU architectures, most efforts have not focused on using them to accelerate Transformer training. To address this issue, we propose SparkAttention, an acceleration library designed to speed up MHA training on the Volta GPU. SparkAttention leverages TCU and kernel fusion to reduce the number of high bandwidth memory (HBM) accesses and overhead. Our End-to-End…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning and ELM