Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models
Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia

TL;DR
This paper provides a large-scale empirical analysis of 308 bugs in distributed training frameworks for large language models, revealing common causes, fixing strategies, and opportunities for automation to improve framework reliability.
Contribution
It is the first comprehensive study analyzing bug characteristics, root causes, and fixing efforts in popular distributed training frameworks for LLMs, offering insights for better debugging and automation.
Findings
48% of bug fixes require minimal code changes (<=10 LOC)
Common bug root causes include allocation and communication errors
Simple fixing strategies often involve conditional logic and parameter handling improvements
Abstract
With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs' characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
