SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

Lei Qu; Lianhai Ren; Peng Cheng; Rui Gao; Ruizhe Wang; Tianyu Chen; Xiao Liu; Xingjian Zhang; Yeyun Gong; Yifan Xiong; Yucheng Ding; Yuting Jiang; Zhenghao Lin; Zhongxin Guo; Ziyue Yang

arXiv:2512.13488·cs.DC·December 16, 2025

SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang, Tianyu Chen, Xiao Liu, Xingjian Zhang, Yeyun Gong, Yifan Xiong, Yucheng Ding, Yuting Jiang, Zhenghao Lin, Zhongxin Guo, Ziyue Yang

PDF

Open Access

TL;DR

SIGMA is a comprehensive open-source training stack that enhances the reliability, stability, and efficiency of large-scale AI training on early-life hardware, addressing core challenges in system disruptions, numerical errors, and parallelism complexity.

Contribution

The paper introduces SIGMA, a novel training stack with the LUCIA TRAINING PLATFORM and FRAMEWORK, optimized for early-life AI accelerators, achieving high utilization, stability, and scalability in large-scale training.

Findings

01

94.45% effective cluster accelerator utilization

02

Trained 200B MoE model with 2,048 accelerators

03

Only one stability incident over 75 days

Abstract

An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Advanced Neural Network Applications