Training Overhead Ratio: A Practical Reliability Metric for Large   Language Model Training Systems

Ning Lu; Qian Xie; Hao Zhang; Wenyi Fang; Yang Zheng; Zheng Hu,; Jiantao Ma

arXiv:2408.07482·cs.DC·October 10, 2024

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

Ning Lu, Qian Xie, Hao Zhang, Wenyi Fang, Yang Zheng, Zheng Hu,, Jiantao Ma

PDF

Open Access

TL;DR

This paper introduces the Training Overhead Ratio (TOR), a new metric to evaluate the reliability of large language model training systems, helping estimate actual training time amidst failures.

Contribution

The paper proposes the TOR metric, defines it mathematically, and analyzes its effectiveness for assessing fault-tolerant LLM training system reliability.

Findings

01

TOR effectively quantifies training system reliability.

02

Key factors influencing reliability are identified.

03

TOR equations for different failure types are presented.

Abstract

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Data Processing Techniques