Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition
Mohamed Amine Bergach

TL;DR
This paper introduces an optimized FFT implementation for Apple Silicon GPUs that surpasses Apple's vDSP performance by leveraging a novel two-tier memory model and detailed kernel optimizations.
Contribution
It presents a new two-tier local memory model and optimized radix-8 Stockham FFT kernels that significantly improve performance on Apple Silicon GPUs.
Findings
Achieved 138.45 GFLOPS for 4096-point FFT, 29% faster than vDSP
Radix-8 butterfly with 512 threads yields best performance
Threadgroup memory barriers are inexpensive, but scattered access patterns are bottlenecks
Abstract
We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
