Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav; Tian Zhao; Deepak Kumar

arXiv:2604.23466·cs.LG·April 28, 2026

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

PDF

TL;DR

This paper evaluates NVIDIA's CUDA Tile abstraction across different GPU architectures and workloads, highlighting its performance benefits and portability limitations compared to other approaches.

Contribution

It provides the first independent, cross-architecture performance evaluation of CuTile on Hopper and Blackwell GPUs for AI workloads.

Findings

01

CuTile achieves up to 1007 TFLOP/s on Blackwell for fused attention.

02

CuTile reaches 52-79% of cuBLAS performance for GEMM with minimal code.

03

Triton demonstrates more consistent portability, maintaining 62-101% of cuBLAS performance.

Abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.