End-to-end Adaptive Distributed Training on PaddlePaddle
Yulong Ao, Zhihua Wu, Dianhai Yu, Weibao Gong, Zhiqing Kui, Minxu, Zhang, Zilingfeng Ye, Liang Shen, Yanjun Ma, Tian Wu, Haifeng Wang, Wei Zeng,, Chao Yang

TL;DR
This paper presents an adaptive end-to-end distributed training framework on PaddlePaddle that efficiently handles diverse models and resources, ensuring high performance, fault tolerance, and elasticity for industrial-scale neural network training.
Contribution
It introduces a unified adaptive distributed training framework with global cost modeling and planning, enabling flexible parallelism, resource-aware placement, and fault tolerance in production environments.
Findings
Efficient training of 260-billion-parameter ERNIE model with 91.7% weak scalability.
Throughput improvements of up to 2.1x and 3.3x over GPU-only and CPU-only training.
Reduced failed training jobs by 34.49% and increased scheduling efficiency by 33.91%.
Abstract
Distributed training has become a pervasive and effective approach for training a large neural network (NN) model with processing massive data. However, it is very challenging to satisfy requirements from various NN models, diverse computing resources, and their dynamic changes during a training job. In this study, we design our distributed training framework in a systematic end-to-end view to provide the built-in adaptive ability for different scenarios, especially for industrial applications and production environments, by fully considering resource allocation, model partition, task placement, and distributed execution. Based on the unified distributed graph and the unified cluster object, our adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
MethodsERNIE · (&&Help~ME~Expedia&&)What is the refundable option on Expedia?
