Understanding Stragglers in Large Model Training Using What-if Analysis

Jinkun Lin; Ziheng Jiang; Zuquan Song; Sida Zhao; Menghan Yu; Zhanghan Wang; Chenyuan Wang; Zuocheng Shi; Xiang Shi; Wei Jia; Zherui Liu; Shuguang Wang; Haibin Lin; Xin Liu; Aurojit Panda; Jinyang Li

arXiv:2505.05713·cs.DC·May 13, 2025

Understanding Stragglers in Large Model Training Using What-if Analysis

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li

PDF

Open Access 1 Repo

TL;DR

This paper investigates the causes and patterns of stragglers in large language model training, using a comprehensive five-month trace and what-if analysis to understand their impact and origins.

Contribution

It introduces a detailed study of stragglers in LLM training with a novel what-if analysis approach based on real trace data.

Findings

01

Stragglers significantly impact training performance.

02

Stragglers exhibit both temporal and spatial patterns.

03

Root causes include complex factors beyond hardware failures.

Abstract

Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance-seed/straggleranalysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management