Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Yongji Wu; Wenjie Qu; Xueshen Liu; Tianyang Tao; Yifan Qiao; Zhuang Wang; Wei Bai; Yuan Tian; Jiaheng Zhang; Z. Morley Mao; Matthew Lentz; Danyang Zhuo; Ion Stoica

arXiv:2407.04656·cs.DC·October 27, 2025·2 cites

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Yongji Wu, Wenjie Qu, Xueshen Liu, Tianyang Tao, Yifan Qiao, Zhuang Wang, Wei Bai, Yuan Tian, Jiaheng Zhang, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

PDF

Open Access

TL;DR

Lazarus is a system designed to enable resilient and elastic training of Mixture-of-Experts models, effectively handling failures and preemptions to improve training efficiency and robustness.

Contribution

Lazarus introduces adaptive expert placement and flexible token dispatching to improve fault tolerance and resource utilization in MoE training, addressing limitations of prior solutions.

Findings

01

Outperforms existing systems by up to 5.7x under failures.

02

Achieves 3.4x speedup on real spot instance traces.

03

Effectively utilizes all nodes after failures.

Abstract

Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs). However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to idle wait until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. This problem is exacerbated by the growing use of spot instances on public clouds for model training, which despite offering substantial cost savings, introduce frequent preemptions-essentially failures that regularly occur throughout the training process. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Target Tracking and Data Fusion in Sensor Networks · Reservoir Engineering and Simulation Methods

MethodsMixture of Experts