Flag Aggregator: Scalable Distributed Training under Failures and   Augmented Losses using Convex Optimization

Hamidreza Almasi; Harsh Mishra; Balajee Vamanan; Sathya N. Ravi

arXiv:2302.05865·cs.LG·September 26, 2023

Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Hamidreza Almasi, Harsh Mishra, Balajee Vamanan, Sathya N. Ravi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a convex optimization-based method for robust distributed training that effectively handles Byzantine failures and data augmentation, improving accuracy and communication efficiency.

Contribution

It formulates aggregation as a maximum likelihood estimation problem and provides a scalable, provably convergent solution that enhances robustness in distributed deep learning.

Findings

01

Significantly improves robustness of Byzantine resilient aggregators

02

Enhances communication efficiency in distributed training

03

Achieves better accuracy across various tasks

Abstract

Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios $\in (0, 1]$ , and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hamidralmasi/flagaggregator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques