ECRM: Efficient Fault Tolerance for Recommendation Model Training via   Erasure Coding

Kaige Liu; Jack Kosaian; K. V. Rashmi

arXiv:2104.01981·cs.LG·April 6, 2021·1 cites

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Kaige Liu, Jack Kosaian, K. V. Rashmi

PDF

Open Access

TL;DR

ECRM introduces an erasure coding-based fault tolerance system for deep-learning recommendation models, significantly reducing recovery time and training overhead compared to traditional checkpointing, enabling more efficient large-scale model training.

Contribution

ECRM is the first system to apply erasure coding for fault tolerance in DLRM training, optimizing encoding, parity updates, and enabling continuous training during recovery.

Findings

01

Reduces training overhead by up to 88%

02

Speeds up recovery by up to 10.3 times

03

Enables training to continue during fault recovery

Abstract

Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Advanced Graph Neural Networks