Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication

Jinliang Shi; Shigang Li; Youxuan Xu; Xueying Wang; Rongtian Fu; Zhi Ma; Tong Wu

arXiv:2506.22714·cs.DC·December 23, 2025

Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication

Jinliang Shi, Shigang Li, Youxuan Xu, Xueying Wang, Rongtian Fu, Zhi Ma, Tong Wu

PDF

Open Access

TL;DR

Libra is a comprehensive framework that optimally combines Tensor Core Units and CUDA cores to significantly accelerate sparse matrix multiplication, enhancing performance in deep learning and scientific computing applications.

Contribution

This work introduces Libra, a novel framework that systematically leverages GPU heterogeneity for high-performance sparse matrix multiplication, combining workload distribution, load balancing, and efficient kernel design.

Findings

01

Achieves up to 2.9x speedup over state-of-the-art baselines.

02

Effectively accelerates end-to-end GNN applications.

03

Demonstrates significant performance improvements on H100 and RTX 4090 GPUs.

Abstract

Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor Core Units (TCUs) and CUDA cores to accelerate sparse operators. The former excels at structured matrix computations, whereas the latter offers greater programming flexibility. However, how to combine these two resources to maximize sparse-operator performance remains unclear. In this work, we first identify the source of performance gains in hybrid computation and systematically analyze their complementary strengths. Motivated by this, we propose Libra, a holistic framework that efficiently leverages heterogeneous computing resources to accelerate both SpMM and SDDMM operators. Specifically, Libra introduces a 2D-aware (locality and utilization) workload distribution method to precisely identify the optimal task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Ferroelectric and Negative Capacitance Devices