A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems
Xiaoxue Ma, Wanwei Zhan, Jiale Chen, Yishu Li, Jacky Keung, Federica Sarro

TL;DR
This paper provides a large-scale empirical analysis of bugs in distributed deep learning systems, identifying common issues, root causes, and fixes to improve reliability and developer understanding.
Contribution
It presents the first comprehensive taxonomy of bugs in distributed deep learning frameworks based on analysis of 849 real-world issues, mapping symptoms to causes and solutions.
Findings
45.1% of bug symptoms are unique to distributed frameworks
Over 60% of issues are resolved through version, dependency, and tuning adjustments
95% of communication setup issues occur exclusively in distributed contexts
Abstract
In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning
