Balanced and Elastic End-to-end Training of Dynamic LLMs
Mohamed Wahib, Muhammed Abdullah Soyturk, Didem Unat

TL;DR
This paper introduces DynMo, an autonomous load balancing method for dynamic large language models that reduces workload imbalance and improves training efficiency across distributed systems.
Contribution
DynMo provides a provably optimal load balancing solution that adaptively equalizes compute loads and consolidates computation, enhancing the scalability of dynamic LLM training.
Findings
Achieves up to 4.52x speedup in training with dynamic LLM techniques.
Effectively balances workload across workers in distributed training.
Supports both multi-GPU and multi-node GPU clusters.
Abstract
To reduce the computational and memory overhead of Large Language Models, various approaches have been proposed. These include a) Mixture of Experts (MoEs), where token routing affects compute balance; b) gradual pruning of model parameters; c) dynamically freezing layers; d) dynamic sparse attention mechanisms; e) early exit of tokens as they pass through model layers; and f) Mixture of Depths (MoDs), where tokens bypass certain blocks. While these approaches are effective in reducing overall computation, they often introduce significant workload imbalance across workers. In many cases, this imbalance is severe enough to render the techniques impractical for large-scale distributed training, limiting their applicability to toy models due to poor efficiency. We propose an autonomous dynamic load balancing solution, DynMo, which provably achieves maximum reduction in workload imbalance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
