A Survey on Fault-tolerance in Distributed Optimization and Machine   Learning

Shuo Liu

arXiv:2106.08545·cs.DC·June 29, 2021·6 cites

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

Shuo Liu

PDF

Open Access

TL;DR

This survey reviews the current state of fault-tolerance in distributed optimization and machine learning, highlighting the importance of resilient algorithms to handle failures, communication issues, and attacks in large-scale systems.

Contribution

It provides a comprehensive overview of existing fault-tolerance theories and algorithms in distributed optimization and machine learning.

Findings

01

Identification of key fault-tolerance techniques

02

Analysis of robustness in distributed algorithms

03

Highlighting open challenges and future directions

Abstract

The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid expansion of the scale of distributed systems, resilient distributed algorithms for optimization are needed, in order to mitigate system failures, communication issues, or even malicious attacks. This survey investigates the current state of fault-tolerance research in distributed optimization, and aims to provide an overview of the existing studies on both fault-tolerant distributed optimization theories and applicable algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Smart Grid Security and Resilience