Unicron: Economizing Self-Healing LLM Training at Scale
Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren, Zhou

TL;DR
Unicron is a workload manager that enhances large-scale language model training by reducing failure recovery costs and improving efficiency through real-time error detection and dynamic reconfiguration.
Contribution
It introduces a novel self-healing system that optimizes failure recovery in large-scale LLM training, addressing diverse failure scenarios with cost-aware strategies.
Findings
Up to 1.9x training efficiency improvement
Significant reduction in failure recovery costs
Effective real-time error detection without extra overhead
Abstract
Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
