AB-Training: A Communication-Efficient Approach for Distributed Low-Rank   Learning

Daniel Coquelin; Katherina Fl\"ugel; Marie Weiel; Nicholas Kiefer,; Muhammed \"Oz; Charlotte Debus; Achim Streit; Markus G\"otz

arXiv:2405.01067·cs.LG·July 2, 2024

AB-Training: A Communication-Efficient Approach for Distributed Low-Rank Learning

Daniel Coquelin, Katherina Fl\"ugel, Marie Weiel, Nicholas Kiefer,, Muhammed \"Oz, Charlotte Debus, Achim Streit, Markus G\"otz

PDF

Open Access

TL;DR

AB-training is a communication-efficient distributed training method that uses low-rank representations and independent groups to significantly reduce network traffic, improve scalability, and enhance generalization in neural network training.

Contribution

This paper introduces AB-training, a novel low-rank, data-parallel approach that reduces communication overhead and improves training efficiency in distributed neural network training.

Findings

01

Reduced network traffic by approximately 70.31% across various scenarios

02

Achieved a 44.14:1 compression ratio on VGG16 with minimal accuracy loss

03

Outperformed traditional data parallel training by 1.55% on ResNet-50 with ImageNet

Abstract

Communication bottlenecks severely hinder the scalability of distributed neural network training, particularly in high-performance computing (HPC) environments. We introduce AB-training, a novel data-parallel method that leverages low-rank representations and independent training groups to significantly reduce communication overhead. Our experiments demonstrate an average reduction in network traffic of approximately 70.31\% across various scaling scenarios, increasing the training potential of communication-constrained systems and accelerating convergence at scale. AB-training also exhibits a pronounced regularization effect at smaller scales, leading to improved generalization while maintaining or even reducing training time. We achieve a remarkable 44.14 : 1 compression ratio on VGG16 trained on CIFAR-10 with minimal accuracy loss, and outperform traditional data parallel training by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Security in Wireless Sensor Networks · Machine Learning and Algorithms