DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Size Zheng; Xuegui Zheng; Hanshi Sun; Qi Hou; Wenlei Bao; Shiyu Li; Haojie Duanmu; Jin Fang; Chenli Xue; Chenhui Huang; Yuanqiang Liu; Renze Chen; Ningxin Zheng; Dongyang Wang; Li-Wen Chang; Liqiang Lu; Yun Liang; Jidong Zhai; Xin Liu

arXiv:2605.02953·cs.PL·May 6, 2026

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Size Zheng, Xuegui Zheng, Hanshi Sun, Qi Hou, Wenlei Bao, Shiyu Li, Haojie Duanmu, Jin Fang, Chenli Xue, Chenhui Huang, Yuanqiang Liu, Renze Chen, Ningxin Zheng, Dongyang Wang, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, Xin Liu

PDF

TL;DR

DITRON is a distributed tensor compiler that introduces a hierarchical abstraction to optimize large language model performance across heterogeneous hardware, achieving significant speedups and enterprise deployment.

Contribution

It presents a novel multi-level tiling compiler with hierarchical abstraction supporting diverse parallelism for distributed tensor programs.

Findings

01

Achieves 6-30% speedup on kernels and 5-30% on inference.

02

Demonstrates portability across NVIDIA and AMD hardware.

03

Deployed at enterprise scale, saving GPU hours and improving training and inference efficiency.

Abstract

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.