Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Xiao Yu; Haoxuan Chen; Feifei Niu; Xing Hu; Jacky Wai Keung; Xin Xia

arXiv:2506.10426·cs.SE·June 13, 2025

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia

PDF

Open Access

TL;DR

This paper provides a large-scale empirical analysis of 308 bugs in distributed training frameworks for large language models, revealing common causes, fixing strategies, and opportunities for automation to improve framework reliability.

Contribution

It is the first comprehensive study analyzing bug characteristics, root causes, and fixing efforts in popular distributed training frameworks for LLMs, offering insights for better debugging and automation.

Findings

01

48% of bug fixes require minimal code changes (<=10 LOC)

02

Common bug root causes include allocation and communication errors

03

Simple fixing strategies often involve conditional logic and parameter handling improvements

Abstract

With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs' characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability