TL;DR
This paper introduces cuFINUFFT, a GPU library for nonuniform FFTs that achieves high performance and load balancing, significantly outperforming CPU and existing GPU implementations in various applications.
Contribution
The paper presents a general-purpose, load-balanced GPU library for nonuniform FFTs in 2D and 3D, with high accuracy and superior performance over CPU and existing GPU codes.
Findings
Achieves 10^9 points/sec at low accuracy on GPU.
Up to 90x faster than GPU codes at high accuracy.
Demonstrates 5-12x speedup over CPU in 3D X-ray diffraction reconstruction.
Abstract
Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around nonuniform points per second, and (even including host-device transfer) is typically 4-10 faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90 faster at high accuracy and/or type 1 clustered point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
