TTrace: Lightweight Error Checking and Diagnosis for Distributed Training
Haitian Jiang, Shaowei Zhu, Zhen Zhang, Zhenyu Song, Xinwei Fu, Zhen Jia, Yida Wang, Jinyang Li

TL;DR
TTrace is a systematic differential testing system that detects and localizes silent bugs in distributed neural network training by aligning tensors with a trusted reference, improving debugging efficiency.
Contribution
It introduces TTrace, the first differential testing system for silent bug detection in distributed training, with a novel mathematical analysis for tensor comparison tolerances.
Findings
Effectively detects 14 bugs in Megatron-LM framework
Requires fewer than 10 lines of code changes
Works across various training recipes including low-precision
Abstract
Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signals but lead to incorrect training outcomes. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practices based on monitoring training loss or gradient norm curves are indirect, inefficient, and provide no way to localize bugs. To address those challenges, we design and implement TTrace, the first systematic differential testing system for detecting and localizing silent bugs in distributed training. TTrace aligns intermediate tensors from distributed training with those from a trusted reference implementation. To properly compare the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
