Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Julian Bellavita; Lorenzo Pichetti; Thomas Pasquali; Flavio Vella; Giulia Guidi

arXiv:2603.21444·cs.DC·March 25, 2026

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, Giulia Guidi

PDF

Open Access

TL;DR

This paper introduces Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that reduces communication and improves performance on hierarchical GPU interconnects by exploiting intra-node bandwidth advantages.

Contribution

The paper presents a novel trident partitioning scheme and communication-avoiding techniques tailored for hierarchical GPU architectures, significantly enhancing sparse matrix multiplication efficiency.

Findings

01

Up to 2.38x speedup over traditional 2D SpGEMM

02

Internode communication volume reduced by up to 2x

03

Effective acceleration of Markov Clustering tasks

Abstract

The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Parallel Computing and Optimization Techniques · Interconnection Networks and Systems