Unicron: Economizing Self-Healing LLM Training at Scale

Tao He; Xue Li; Zhibin Wang; Kun Qian; Jingbo Xu; Wenyuan Yu; Jingren; Zhou

arXiv:2401.00134·cs.DC·January 8, 2024·5 cites

Unicron: Economizing Self-Healing LLM Training at Scale

Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren, Zhou

PDF

Open Access

TL;DR

Unicron is a workload manager that enhances large-scale language model training by reducing failure recovery costs and improving efficiency through real-time error detection and dynamic reconfiguration.

Contribution

It introduces a novel self-healing system that optimizes failure recovery in large-scale LLM training, addressing diverse failure scenarios with cost-aware strategies.

Findings

01

Up to 1.9x training efficiency improvement

02

Significant reduction in failure recovery costs

03

Effective real-time error detection without extra overhead

Abstract

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques