Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, Bin Cui

TL;DR
Hydraulis is a system that improves large Transformer training efficiency by co-optimizing parallel strategies and data assignment to address workload imbalances caused by data sampling and packing issues.
Contribution
It introduces a dynamic heterogeneous parallel strategy and a two-stage data assignment approach to balance training workloads in large Transformer models.
Findings
Hydraulis outperforms existing systems by 1.32-2.66 times.
Effectively mitigates data sampling and packing imbalances.
Enhances training efficiency for large Transformer models.
Abstract
To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Line Inspection Robots · Advanced Neural Network Applications · Oil and Gas Production Techniques
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
