BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman

TL;DR
This paper investigates startup overhead in large-scale LLM training, characterizes its components, and introduces BootSeer, a system that reduces startup delays by 50% through innovative techniques, improving efficiency in industrial settings.
Contribution
It provides the first detailed analysis of LLM startup overhead and proposes BootSeer, a novel system with techniques that significantly reduce startup time in production environments.
Findings
Startup overhead accounts for over 3.5% GPU time in large LLM training.
BootSeer reduces startup overhead by 50% in real workloads.
Analysis of startup components informs targeted optimizations.
Abstract
Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Scientific Computing and Data Management · Simulation Techniques and Applications
