Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
Yifan Zhu, Yekai Pan, Chen Ding

TL;DR
This paper introduces Sawtooth Wavefront Reordering, a technique that significantly reduces L2 cache misses and boosts throughput for CuTile FlashAttention on NVIDIA GB10, enhancing large language model performance.
Contribution
It presents a novel cache optimization method, Sawtooth Wavefront Reordering, improving CuTile FlashAttention efficiency on NVIDIA GB10 by reducing cache misses.
Findings
50% or greater reduction in L2 cache misses
Up to 60% increase in throughput
Effective on both CUDA and CuTile implementations
Abstract
High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Big Data and Digital Economy
