A Survey on Fault-tolerance in Distributed Optimization and Machine Learning
Shuo Liu

TL;DR
This survey reviews the current state of fault-tolerance in distributed optimization and machine learning, highlighting the importance of resilient algorithms to handle failures, communication issues, and attacks in large-scale systems.
Contribution
It provides a comprehensive overview of existing fault-tolerance theories and algorithms in distributed optimization and machine learning.
Findings
Identification of key fault-tolerance techniques
Analysis of robustness in distributed algorithms
Highlighting open challenges and future directions
Abstract
The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid expansion of the scale of distributed systems, resilient distributed algorithms for optimization are needed, in order to mitigate system failures, communication issues, or even malicious attacks. This survey investigates the current state of fault-tolerance research in distributed optimization, and aims to provide an overview of the existing studies on both fault-tolerant distributed optimization theories and applicable algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Smart Grid Security and Resilience
