Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment
Shiwei Zhang, Xiaodong Yi, Lansong Diao, Chuan Wu, Siyu Wang, and Wei, Lin

TL;DR
This paper introduces TAG, a system that optimizes distributed DNN training by considering device topology and computation graphs, achieving significant speed-ups and adaptable deployment strategies across heterogeneous clusters.
Contribution
The paper proposes a novel GNN-based approach combined with search methods to optimize DNN training graphs for heterogeneous device topologies, including a lossless gradient compression technique.
Findings
Achieves up to 4.56x training speed-up
Effective for unseen models and topologies
Reduces communication overhead
Abstract
This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Graph Neural Networks · IoT and Edge/Fog Computing
MethodsGraph Neural Network
