Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training
Tehila Dahan, Kfir Y. Levy

TL;DR
This paper introduces a novel, efficient meta-aggregation method and a gradient estimation technique to improve Byzantine-robust distributed machine learning, enhancing resilience and reducing tuning complexity.
Contribution
It presents the Centered Trimmed Meta Aggregator (CTMA) for optimal, low-cost aggregation and a double-momentum gradient estimator, advancing Byzantine robustness in distributed ML.
Findings
CTMA achieves optimal performance with low computational cost.
Double-momentum gradient estimator simplifies tuning and reduces hyperparameters.
Empirical results confirm theoretical advantages in Byzantine-robust training.
Abstract
In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures-where workers may contribute incorrect updates due to malice or error-gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
