SIGMA: An AI-Empowered Training Stack on Early-Life Hardware
Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang, Tianyu Chen, Xiao Liu, Xingjian Zhang, Yeyun Gong, Yifan Xiong, Yucheng Ding, Yuting Jiang, Zhenghao Lin, Zhongxin Guo, Ziyue Yang

TL;DR
SIGMA is a comprehensive open-source training stack that enhances the reliability, stability, and efficiency of large-scale AI training on early-life hardware, addressing core challenges in system disruptions, numerical errors, and parallelism complexity.
Contribution
The paper introduces SIGMA, a novel training stack with the LUCIA TRAINING PLATFORM and FRAMEWORK, optimized for early-life AI accelerators, achieving high utilization, stability, and scalability in large-scale training.
Findings
94.45% effective cluster accelerator utilization
Trained 200B MoE model with 2,048 accelerators
Only one stability incident over 75 days
Abstract
An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Advanced Neural Network Applications
