Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

TL;DR
Zeno introduces a fault-tolerant distributed SGD method that can handle any number of faulty workers by suspecting and ranking workers, ensuring convergence even with many faults.
Contribution
Zeno extends fault-tolerance in distributed SGD to scenarios with arbitrary faulty workers, using suspicion and ranking mechanisms for robustness.
Findings
Zeno outperforms existing fault-tolerance methods in experiments.
Proves convergence of SGD with suspicion-based fault detection in non-convex settings.
Handles any number of faulty workers, not just a majority.
Abstract
We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
