A Performance-Portable, Massively Parallel Distributed Nonuniform FFT
Paul Fischill, Andreas Adelmann, Sriramkrishnan Muralikrishnan

TL;DR
This paper introduces the first distributed, performance-portable NUFFT implementation that scales across heterogeneous supercomputers, enabling large-scale spectral simulations with irregular data.
Contribution
It presents a Kokkos-based distributed NUFFT supporting GPUs from NVIDIA and AMD, with optimized kernels and demonstrated scalability on multiple supercomputers.
Findings
Achieves comparable or better throughput than CUDA-based cuFINUFFT.
Supports AMD GPUs via Kokkos, broadening hardware compatibility.
Enables large-scale kinetic plasma simulations with billions of particles.
Abstract
The nonuniform fast Fourier transform (NUFFT) enables spectral methods for problems with irregularly spaced samples, with applications in medical imaging, molecular dynamics, and kinetic plasma simulations. Existing implementations are limited to shared-memory execution, restricting problem sizes to what fits on a single node. We present the first distributed, performance-portable NUFFT for heterogeneous supercomputers. Our Kokkos-based implementation runs without modification on NVIDIA and AMD GPUs. We develop multiple spreading and interpolation kernels optimized for different accuracy requirements and architectures. Our spreading kernels match or exceed the single-GPU throughput of the state-of-the-art CUDA-based NUFFT library cuFINUFFT at production particle densities, while our Kokkos-based implementation additionally supports AMD GPUs. Strong scaling experiments on Alps (NVIDIA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
