Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
Ray Cao, Sherry Luo, Steve Gan, Sujeeth Jinesh

TL;DR
This paper investigates how relaxing data consistency with a stateless parameter server during failures can improve training resilience and accuracy, offering a novel approach that maintains progress despite server downtime.
Contribution
The study introduces a novel stateless parameter server method that allows continued training during failures, outperforming traditional checkpointing and chain replication in resilience and accuracy.
Findings
Stateless parameter server maintains training progress during failures.
The approach improves accuracy by up to 10% despite stale updates.
Resource costs are comparable to standard checkpointing methods.
Abstract
In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
