Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
Ning Lu, Qian Xie, Hao Zhang, Wenyi Fang, Yang Zheng, Zheng Hu,, Jiantao Ma

TL;DR
This paper introduces the Training Overhead Ratio (TOR), a new metric to evaluate the reliability of large language model training systems, helping estimate actual training time amidst failures.
Contribution
The paper proposes the TOR metric, defines it mathematically, and analyzes its effectiveness for assessing fault-tolerant LLM training system reliability.
Findings
TOR effectively quantifies training system reliability.
Key factors influencing reliability are identified.
TOR equations for different failure types are presented.
Abstract
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Data Processing Techniques
