ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
Kaige Liu, Jack Kosaian, K. V. Rashmi

TL;DR
ECRM introduces an erasure coding-based fault tolerance system for deep-learning recommendation models, significantly reducing recovery time and training overhead compared to traditional checkpointing, enabling more efficient large-scale model training.
Contribution
ECRM is the first system to apply erasure coding for fault tolerance in DLRM training, optimizing encoding, parity updates, and enabling continuous training during recovery.
Findings
Reduces training overhead by up to 88%
Speeds up recovery by up to 10.3 times
Enables training to continue during fault recovery
Abstract
Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Advanced Graph Neural Networks
