Scalable Multi-node Fast Fourier Transform on GPUs
Manthan Verma, Soumyadeep Chatterjee, Gaurav Garg, Bharatkumar Sharma,, Nishant Arya, Shashi Kumar, Anish Saxena, Mahendra K. Verma

TL;DR
This paper introduces a scalable multi-node GPU-FFT library optimized for high-performance computing, demonstrating efficient scaling on large GPU clusters with impressive performance metrics.
Contribution
The paper presents a novel multi-node GPU-FFT library employing slab decomposition and MPI, achieving scalable performance on large GPU clusters with detailed benchmarking.
Findings
Good scaling observed for 4096^3 grid with 64 to 512 GPUs
GPU-FFT timings comparable to multicore CPU FFT on large cores
Efficient communication via NVlink enhances GPU-FFT performance
Abstract
In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on Selene HPC system. Our library employs slab decomposition for data division and MPI for communication among GPUs. We performed GPU-FFT on , , and grids using a maximum of 512 A100 GPUs. We observed good scaling for grid with 64 to 512 GPUs. We report that the timings of multicore FFT of grid with 196608 cores of Cray XC40 is comparable to that of GPU-FFT of grid with 128 GPUs. The efficiency of GPU-FFT is due to the fast computation capabilities of A100 card and efficient communication via NVlink.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Advanced Data Compression Techniques
