A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

Xiaoxue Ma; Wanwei Zhan; Jiale Chen; Yishu Li; Jacky Keung; Federica Sarro

arXiv:2512.20345·cs.SE·December 24, 2025

A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

Xiaoxue Ma, Wanwei Zhan, Jiale Chen, Yishu Li, Jacky Keung, Federica Sarro

PDF

Open Access

TL;DR

This paper provides a large-scale empirical analysis of bugs in distributed deep learning systems, identifying common issues, root causes, and fixes to improve reliability and developer understanding.

Contribution

It presents the first comprehensive taxonomy of bugs in distributed deep learning frameworks based on analysis of 849 real-world issues, mapping symptoms to causes and solutions.

Findings

01

45.1% of bug symptoms are unique to distributed frameworks

02

Over 60% of issues are resolved through version, dependency, and tuning adjustments

03

95% of communication setup issues occur exclusively in distributed contexts

Abstract

In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning