Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC   Architectures

Chenhao Xie; Jieyang Chen; Jesun S Firoz; Jiajia Li; Shuaiwen Leon; Song; Kevin Barker; Mark Raugas; Ang Li

arXiv:2012.06959·cs.DC·December 15, 2020

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Chenhao Xie, Jieyang Chen, Jesun S Firoz, Jiajia Li, Shuaiwen Leon, Song, Kevin Barker, Mark Raugas, Ang Li

PDF

Open Access

TL;DR

This paper presents a novel multi-GPU sparse triangular solver that leverages NVSHMEM and a malleable task-pool model to significantly improve performance and scalability over unified memory approaches.

Contribution

The work introduces a scalable multi-GPU SpTRSV design using NVSHMEM and a malleable task-pool model, addressing irregular memory references and workload imbalance.

Findings

01

Achieves up to 9.86x speedup on DGX-1 systems.

02

Demonstrates effective utilization of multi-GPU resources.

03

Outperforms unified memory-based approaches significantly.

Abstract

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Distributed and Parallel Computing Systems