Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

TL;DR
This paper evaluates NVIDIA's CUDA Tile abstraction across different GPU architectures and workloads, highlighting its performance benefits and portability limitations compared to other approaches.
Contribution
It provides the first independent, cross-architecture performance evaluation of CuTile on Hopper and Blackwell GPUs for AI workloads.
Findings
CuTile achieves up to 1007 TFLOP/s on Blackwell for fused attention.
CuTile reaches 52-79% of cuBLAS performance for GEMM with minimal code.
Triton demonstrates more consistent portability, maintaining 62-101% of cuBLAS performance.
Abstract
NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
