A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures
Semih Akkurt, S\'ebastien Lemaire, Paul Bartholomew, Sylvain Laizet

TL;DR
This paper introduces DistD2-TDS, a distributed-memory tridiagonal solver leveraging a specialised data structure to optimise communication, data locality, and vectorisation, enabling efficient large-scale PDE solutions on CPU and GPU supercomputers.
Contribution
The paper presents a novel distributed-memory tridiagonal solver algorithm that reduces communication and enhances performance through a specialised data structure optimized for CPU and GPU architectures.
Findings
Achieves 66% of theoretical peak bandwidth at scale
Demonstrates strong scaling up to 384 NVIDIA H100 GPUs and 8192 AMD EPYC CPUs
Effectively solves 3D non-linear PDEs using finite difference schemes
Abstract
Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments. The underlying data structure plays a crucial role for the performance of the algorithm. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Matrix Theory and Algorithms
