A Dynamic Weighting Strategy to Mitigate Worker Node Failure in   Distributed Deep Learning

Yuesheng Xu; Arielle Carr

arXiv:2409.09242·cs.LG·September 17, 2024

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Yuesheng Xu, Arielle Carr

PDF

Open Access

TL;DR

This paper introduces a dynamic weighting strategy to address worker node failures in distributed deep learning, improving training efficiency and convergence by mitigating straggler effects.

Contribution

It proposes a novel dynamic weighting approach that enhances robustness and performance in distributed deep learning systems facing node failures.

Findings

01

Improved convergence rates with the proposed strategy

02

Enhanced training efficiency in distributed systems

03

Better test performance under node failure conditions

Abstract

The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI and HR Technologies

MethodsADAHESSIAN · Stochastic Gradient Descent