Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, Peng Huang

TL;DR
This paper introduces TRAINCHECK, a proactive framework that automatically detects silent errors during deep learning training by inferring invariants, significantly improving error detection and debugging capabilities.
Contribution
The paper presents TRAINCHECK, a novel invariant inference-based framework that proactively detects silent training errors in deep learning models, including unknown bugs.
Findings
Successfully detects 18 out of 20 real-world silent errors
Uncovers 6 previously unknown bugs in training libraries
Detects errors within a single training iteration
Abstract
Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TRAINCHECK, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a single training iteration. It also uncovers 6 unknown bugs in popular training libraries that lead to silent errors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Software Reliability and Analysis Research · Intelligent Tutoring Systems and Adaptive Learning
