Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu; Yekai Pan; Chen Ding

arXiv:2601.16032·cs.PF·January 27, 2026

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu, Yekai Pan, Chen Ding

PDF

Open Access

TL;DR

This paper introduces Sawtooth Wavefront Reordering, a technique that significantly reduces L2 cache misses and boosts throughput for CuTile FlashAttention on NVIDIA GB10, enhancing large language model performance.

Contribution

It presents a novel cache optimization method, Sawtooth Wavefront Reordering, improving CuTile FlashAttention efficiency on NVIDIA GB10 by reducing cache misses.

Findings

01

50% or greater reduction in L2 cache misses

02

Up to 60% increase in throughput

03

Effective on both CUDA and CuTile implementations

Abstract

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Big Data and Digital Economy