Accumulated Decoupled Learning: Mitigating Gradient Staleness in   Inter-Layer Model Parallelization

Huiping Zhuang; Zhiping Lin; Kar-Ann Toh

arXiv:2012.03747·cs.LG·December 8, 2020

Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

Huiping Zhuang, Zhiping Lin, Kar-Ann Toh

PDF

Open Access

TL;DR

This paper introduces Accumulated Decoupled Learning (ADL), a method that reduces gradient staleness in asynchronous inter-layer model parallelization, leading to faster convergence and improved classification accuracy.

Contribution

The paper proposes ADL, which incorporates gradient accumulation to mitigate staleness, with theoretical convergence guarantees and empirical validation on large datasets.

Findings

01

ADL reduces gradient staleness effectively.

02

ADL converges to critical points despite asynchrony.

03

ADL outperforms state-of-the-art methods in speed and accuracy.

Abstract

Decoupled learning is a branch of model parallelism which parallelizes the training of a network by splitting it depth-wise into multiple modules. Techniques from decoupled learning usually lead to stale gradient effect because of their asynchronous implementation, thereby causing performance degradation. In this paper, we propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We give both theoretical and empirical evidences regarding how the gradient staleness can be reduced. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. Empirical validation is provided by training deep convolutional neural networks to perform classification tasks on CIFAR-10 and ImageNet datasets. The ADL is shown to outperform several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning and ELM