Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion
Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing, Fan, Siyu Wang, Jun Yang, Wei Lin

TL;DR
DisCo is an automatic compiler module that optimizes deep neural network training across multiple GPUs by intelligently fusing operators and tensors, significantly reducing training time.
Contribution
DisCo introduces a novel GNN-based simulation and backtracking search to optimize joint operator and tensor fusion strategies for distributed DNN training.
Findings
Achieves near-ideal training speed-up with operator/tensor fusion
Outperforms existing fusion schemes in distributed training scenarios
Effectively minimizes communication overhead during training
Abstract
This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNN-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
