TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

Taimur Khan

arXiv:2603.01960·cs.LG·May 12, 2026

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

Taimur Khan

PDF

TL;DR

TiledAttention introduces a flexible, modifiable CUDA kernel for scaled dot-product attention in PyTorch, enabling rapid research and benchmarking with competitive performance on NVIDIA GPUs.

Contribution

The paper presents a novel, schedule-level modifiable CUDA implementation of SDPA in PyTorch, facilitating research without complex CUDA rewrites.

Findings

01

TiledAttention achieves significant speedups over eager attention implementations.

02

It offers a customizable kernel with comparable performance to production fused baselines.

03

The implementation is accessible within PyTorch, supporting rapid experimentation.

Abstract

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K, V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.