Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

Tehila Dahan; Kfir Y. Levy

arXiv:2501.09621·cs.LG·June 5, 2025

Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

Tehila Dahan, Kfir Y. Levy

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a weighted robust aggregation framework combined with variance reduction techniques to improve fault tolerance and convergence in asynchronous Byzantine-robust distributed machine learning systems.

Contribution

It introduces a novel weighted aggregation method tailored for asynchronous settings, extending existing robust aggregators and achieving optimal convergence rates.

Findings

01

Enhanced fault tolerance in asynchronous ML systems.

02

Proven optimal convergence rate with variance reduction.

03

Validated effectiveness through empirical and theoretical analysis.

Abstract

We address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous computing resources. Asynchronous systems, marked by independently operating workers and intermittent updates, uniquely struggle with maintaining integrity against Byzantine failures, which encompass malicious or erroneous actions that disrupt learning. The inherent delays in such settings not only introduce additional bias to the system but also obscure the disruptions caused by Byzantine faults. To tackle these issues, we adapt the Byzantine framework to asynchronous dynamics by introducing a novel weighted robust aggregation framework. This allows for the extension of robust aggregators and a recent meta-aggregator to their weighted versions, mitigating the effects of delayed updates. By further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dahan198/asynchronous-fault-tolerant-ml
pytorchOfficial

Videos

Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML· slideslive

Taxonomy

TopicsEmbedded Systems Design Techniques · Interconnection Networks and Systems · Radiation Effects in Electronics