ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher R\'e

TL;DR
ParallelKittens introduces a simple, reusable set of principles and a minimal CUDA framework that significantly improves multi-GPU kernel performance across diverse workloads by simplifying development and optimizing communication and resource management.
Contribution
It presents a minimal CUDA framework, ParallelKittens, with eight core primitives and a unified template, enabling systematic and practical optimization of multi-GPU kernels.
Findings
Achieves up to 2.33x speedup on Hopper and Blackwell architectures.
Reduces development complexity with fewer than 50 lines of device code.
Demonstrates significant performance improvements across various parallel workloads.
Abstract
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
