HipKittens: Fast and Furious AMD Kernels
William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher R\'e, Simran Arora

TL;DR
HipKittens introduces a framework for developing high-performance AI kernels on AMD GPUs using tile-based abstractions, achieving performance comparable or superior to hand-optimized assembly and enabling cross-vendor compatibility.
Contribution
This work provides the first detailed study of tile-based programming primitives for AMD GPUs and encapsulates these insights in the HipKittens framework, generalizing prior DSLs.
Findings
HK kernels match AMD's hand-optimized assembly kernels for GEMMs and attention.
HK outperforms compiler baselines in various AI workloads.
In some cases, HK exceeds all kernel baselines by 1.2-2.4×.
Abstract
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Embedded Systems Design Techniques
