Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Sreeram Venkat; Kasia Swirydowicz; Noah Wolfe; Omar Ghattas

arXiv:2508.10202·cs.DC·October 6, 2025

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Sreeram Venkat, Kasia Swirydowicz, Noah Wolfe, Omar Ghattas

PDF

TL;DR

This paper introduces a framework for performance portability and mixed-precision optimization of FFT-based GPU algorithms, enabling seamless execution across different GPU architectures with improved performance and scalability.

Contribution

It presents an on-the-fly performance portability framework and a dynamic mixed-precision approach for FFTMatvec, enhancing GPU compatibility and efficiency.

Findings

01

Achieved seamless GPU portability using hipify.

02

Optimized FFTMatvec for AMD GPUs with rocBLAS.

03

Scaled the mixed-precision FFTMatvec to 4,096 GPUs.

Abstract

The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.