A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures

Semih Akkurt; S\'ebastien Lemaire; Paul Bartholomew; Sylvain Laizet

arXiv:2411.13532·cs.DC·July 22, 2025

A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures

Semih Akkurt, S\'ebastien Lemaire, Paul Bartholomew, Sylvain Laizet

PDF

Open Access

TL;DR

This paper introduces DistD2-TDS, a distributed-memory tridiagonal solver leveraging a specialised data structure to optimise communication, data locality, and vectorisation, enabling efficient large-scale PDE solutions on CPU and GPU supercomputers.

Contribution

The paper presents a novel distributed-memory tridiagonal solver algorithm that reduces communication and enhances performance through a specialised data structure optimized for CPU and GPU architectures.

Findings

01

Achieves 66% of theoretical peak bandwidth at scale

02

Demonstrates strong scaling up to 384 NVIDIA H100 GPUs and 8192 AMD EPYC CPUs

03

Effectively solves 3D non-linear PDEs using finite difference schemes

Abstract

Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments. The underlying data structure plays a crucial role for the performance of the algorithm. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Matrix Theory and Algorithms