Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML
Tehila Dahan, Kfir Y. Levy

TL;DR
This paper proposes a weighted robust aggregation framework combined with variance reduction techniques to improve fault tolerance and convergence in asynchronous Byzantine-robust distributed machine learning systems.
Contribution
It introduces a novel weighted aggregation method tailored for asynchronous settings, extending existing robust aggregators and achieving optimal convergence rates.
Findings
Enhanced fault tolerance in asynchronous ML systems.
Proven optimal convergence rate with variance reduction.
Validated effectiveness through empirical and theoretical analysis.
Abstract
We address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous computing resources. Asynchronous systems, marked by independently operating workers and intermittent updates, uniquely struggle with maintaining integrity against Byzantine failures, which encompass malicious or erroneous actions that disrupt learning. The inherent delays in such settings not only introduce additional bias to the system but also obscure the disruptions caused by Byzantine faults. To tackle these issues, we adapt the Byzantine framework to asynchronous dynamics by introducing a novel weighted robust aggregation framework. This allows for the extension of robust aggregators and a recent meta-aggregator to their weighted versions, mitigating the effects of delayed updates. By further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmbedded Systems Design Techniques · Interconnection Networks and Systems · Radiation Effects in Electronics
