GPU-Accelerated Cholesky Factorization of Block Tridiagonal Matrices
Roland Schwan, Daniel Kuhn, Colin N. Jones

TL;DR
This paper introduces a GPU-accelerated framework for efficiently solving block tridiagonal linear systems using a novel permutation strategy and parallel implementation, significantly outperforming existing solvers especially for long-horizon problems.
Contribution
The paper presents a new GPU-based algorithm for block tridiagonal systems that reduces complexity and achieves high speedups, enabling real-time applications.
Findings
Speedups exceeding 100x over QDLDL
25x faster than optimized CPU implementation
Over 2x faster than NVIDIA CUDSS
Abstract
This paper presents a GPU-accelerated framework for solving block tridiagonal linear systems that arise naturally in numerous real-time applications across engineering and scientific computing. Through a multi-stage permutation strategy based on nested dissection, we reduce the computational complexity from for sequential Cholesky factorization to when sufficient parallel resources are available, where is the block size and is the number of blocks. The algorithm is implemented using NVIDIA's Warp library and CUDA to exploit parallelism at multiple levels within the factorization algorithm. Our implementation achieves speedups exceeding 100x compared to the sparse solver QDLDL, 25x compared to a highly optimized CPU implementation using BLASFEO, and more than 2x compared to NVIDIA's CUDSS library. The logarithmic scaling with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Parallel Computing and Optimization Techniques · Interconnection Networks and Systems
