Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search
Mohamed Amine Bergach

TL;DR
This paper models FFT instruction scheduling as a shortest-path problem on a graph, introducing context-aware modeling to optimize SIMD instruction sequences considering cache effects, leading to significant performance improvements.
Contribution
It formalizes context-aware graph models for FFT instruction scheduling, capturing cache effects, and demonstrates their effectiveness on Apple M1 NEON with substantial speedups.
Findings
Context-aware model achieves 34% faster FFT than context-free model.
Optimal scheduling includes non-traditional radix-2 passes to exploit cache residuals.
Graph search approach finds arrangements that traditional methods miss.
Abstract
An -point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
