TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
Yue Guan, Hongtao Yu, Peng Chen, Daohang Shi, Karthik Manivannan, Nicholas J Riasanovsky, Manman Ren, Lei Wang, Shane Nay, Partha Kanuparthy, Zaifeng Pan, Zhengding Hu, and Yufei Ding

TL;DR
TLX is a GPU compiler extension that enables hardware-native, evolvable orchestration of multi-warp execution and data movement, improving performance and flexibility in large-scale environments.
Contribution
Introduces TLX, an embedded extension to Triton, for explicit multi-warp orchestration, balancing hardware complexity and programming ease.
Findings
Supports substantial customization with limited development effort.
Remains competitive with state-of-the-art implementations.
Deployed in large-scale training and inference systems.
Abstract
Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
