TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
Taimur Khan

TL;DR
TiledAttention introduces a flexible, modifiable CUDA kernel for scaled dot-product attention in PyTorch, enabling rapid research and benchmarking with competitive performance on NVIDIA GPUs.
Contribution
The paper presents a novel, schedule-level modifiable CUDA implementation of SDPA in PyTorch, facilitating research without complex CUDA rewrites.
Findings
TiledAttention achieves significant speedups over eager attention implementations.
It offers a customizable kernel with comparable performance to production fused baselines.
The implementation is accessible within PyTorch, supporting rapid experimentation.
Abstract
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
