Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

Yuxuan Jiang; Ziming Zhou; Boyu Xu; Beijie Liu; Runhui Xu; Peng Huang

arXiv:2506.14813·cs.LG·June 19, 2025

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, Peng Huang

PDF

Open Access

TL;DR

This paper introduces TRAINCHECK, a proactive framework that automatically detects silent errors during deep learning training by inferring invariants, significantly improving error detection and debugging capabilities.

Contribution

The paper presents TRAINCHECK, a novel invariant inference-based framework that proactively detects silent training errors in deep learning models, including unknown bugs.

Findings

01

Successfully detects 18 out of 20 real-world silent errors

02

Uncovers 6 previously unknown bugs in training libraries

03

Detects errors within a single training iteration

Abstract

Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TRAINCHECK, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a single training iteration. It also uncovers 6 unknown bugs in popular training libraries that lead to silent errors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Reliability and Analysis Research · Intelligent Tutoring Systems and Adaptive Learning