Wafer-Scale Fast Fourier Transforms

Marcelo Orenes-Vera; Ilya Sharapov; Robert Schreiber; Mathias; Jacquelin; Philippe Vandermersch; Sharan Chetlur

arXiv:2209.15040·cs.DC·June 26, 2023

Wafer-Scale Fast Fourier Transforms

Marcelo Orenes-Vera, Ilya Sharapov, Robert Schreiber, Mathias, Jacquelin, Philippe Vandermersch, Sharan Chetlur

PDF

Open Access

TL;DR

This paper presents a highly parallelized wafer-scale implementation of 1D, 2D, and 3D FFTs on the Cerebras CS-2, achieving unprecedented performance and scaling for large 3D FFTs by leveraging the wafer-scale architecture.

Contribution

The paper introduces the first wafer-scale FFT implementation that exploits the Cerebras CS-2's architecture for efficient parallelization and communication, breaking performance barriers for large 3D FFTs.

Findings

01

Achieved 959 microseconds for 3D FFT of 512^3 complex array.

02

First implementation to break the millisecond barrier for this problem size.

03

Demonstrated near-peak bandwidth utilization on wafer-scale mesh.

Abstract

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a $n^{3}$ problem with up to $n^{2}$ PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D IC and TSV technologies · Interconnection Networks and Systems · Parallel Computing and Optimization Techniques