EROICA: Online Performance Troubleshooting for Large-scale Model Training
Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan Zhai

TL;DR
EROICA is an innovative online troubleshooting system designed for large-scale GPU clusters, enabling real-time diagnosis of hardware and software performance issues with high accuracy and minimal impact.
Contribution
It introduces the first comprehensive online troubleshooting approach for large-scale model training, combining profiling and differential observability across all machines.
Findings
Achieved 97.5% success rate in diagnosing performance issues.
Deployed on clusters with approximately 100,000 GPUs for over 1.5 years.
Effectively identifies both hardware and software problems in production environments.
Abstract
Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Software System Performance and Reliability · Advanced Neural Network Applications
Methodstravel james
