TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

Yue Guan; Hongtao Yu; Peng Chen; Daohang Shi; Karthik Manivannan; Nicholas J Riasanovsky; Manman Ren; Lei Wang; Shane Nay; Partha Kanuparthy; Zaifeng Pan; Zhengding Hu; and Yufei Ding

arXiv:2605.10905·cs.AR·May 15, 2026

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

Yue Guan, Hongtao Yu, Peng Chen, Daohang Shi, Karthik Manivannan, Nicholas J Riasanovsky, Manman Ren, Lei Wang, Shane Nay, Partha Kanuparthy, Zaifeng Pan, Zhengding Hu, and Yufei Ding

PDF

1 Repo

TL;DR

TLX is a GPU compiler extension that enables hardware-native, evolvable orchestration of multi-warp execution and data movement, improving performance and flexibility in large-scale environments.

Contribution

Introduces TLX, an embedded extension to Triton, for explicit multi-warp orchestration, balancing hardware complexity and programming ease.

Findings

01

Supports substantial customization with limited development effort.

02

Remains competitive with state-of-the-art implementations.

03

Deployed in large-scale training and inference systems.

Abstract

Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookexperimental/triton
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.