FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
Vincent Abbott, Gioele Zardini

TL;DR
This paper introduces a diagrammatic, resource-aware approach to optimize deep learning algorithms on GPUs, enabling systematic derivation of high-level strategies and better understanding of techniques like FlashAttention.
Contribution
It extends Neural Circuit Diagrams to include resource usage and task distribution, facilitating hardware-aware optimization and analysis of deep learning algorithms.
Findings
Diagrams can derive streaming and tiling strategies.
High-level performance models incorporate quantization and GPU hierarchy effects.
Methodology enhances understanding of existing techniques like FlashAttention.
Abstract
Optimizing deep learning algorithms currently requires slow, manual derivation, potentially leaving much performance untapped. Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers, but required three iterations over three years to be developed. Automated compiled methods have consistently lagged behind. This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy. We show how diagrams can use simple relabellings to derive high-level streaming and tiling optimization strategies along with performance models. We show how this high-level performance model allows the effects of quantization and multi-level GPU hierarchies to be readily considered. We develop a methodology for representing intermediate-level pseudocode with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
