Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models
Yongji Wu, Wenjie Qu, Xueshen Liu, Tianyang Tao, Yifan Qiao, Zhuang Wang, Wei Bai, Yuan Tian, Jiaheng Zhang, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

TL;DR
Lazarus is a system designed to enable resilient and elastic training of Mixture-of-Experts models, effectively handling failures and preemptions to improve training efficiency and robustness.
Contribution
Lazarus introduces adaptive expert placement and flexible token dispatching to improve fault tolerance and resource utilization in MoE training, addressing limitations of prior solutions.
Findings
Outperforms existing systems by up to 5.7x under failures.
Achieves 3.4x speedup on real spot instance traces.
Effectively utilizes all nodes after failures.
Abstract
Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs). However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to idle wait until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. This problem is exacerbated by the growing use of spot instances on public clouds for model training, which despite offering substantial cost savings, introduce frequent preemptions-essentially failures that regularly occur throughout the training process. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Target Tracking and Data Fusion in Sensor Networks · Reservoir Engineering and Simulation Methods
MethodsMixture of Experts
