ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos, Kozyrakis

TL;DR
ReCycle is a system that enhances large DNN training resilience by dynamically rerouting micro-batches during failures, maintaining high throughput without spare servers through pipeline schedule optimizations.
Contribution
ReCycle introduces a novel failure-tolerance method for large-scale DNN training that leverages inherent redundancy and pipeline adjustments to sustain throughput.
Findings
ReCycle outperforms recent fault-tolerance methods by up to 1.64x.
It maintains high training throughput under multiple failures.
The system effectively minimizes throughput degradation during failures.
Abstract
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
