Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Mohamed Amine Bergach

arXiv:2603.27569·cs.DC·March 31, 2026

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Mohamed Amine Bergach

PDF

TL;DR

This paper introduces an optimized FFT implementation for Apple Silicon GPUs that surpasses Apple's vDSP performance by leveraging a novel two-tier memory model and detailed kernel optimizations.

Contribution

It presents a new two-tier local memory model and optimized radix-8 Stockham FFT kernels that significantly improve performance on Apple Silicon GPUs.

Findings

01

Achieved 138.45 GFLOPS for 4096-point FFT, 29% faster than vDSP

02

Radix-8 butterfly with 512 threads yields best performance

03

Threadgroup memory barriers are inexpensive, but scattered access patterns are bottlenecks

Abstract

We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N = 4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.