Straggler Mitigation through Unequal Error Protection for Distributed   Approximate Matrix Multiplication

Busra Tegin; Eduin. E. Hernandez; Stefano Rini; Tolga M. Duman

arXiv:2103.02928·cs.DC·July 28, 2021

Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication

Busra Tegin, Eduin. E. Hernandez, Stefano Rini, Tolga M. Duman

PDF

1 Repo

TL;DR

This paper introduces a novel approach using Unequal Error Protection (UEP) codes to mitigate stragglers in distributed matrix multiplication, improving training time for deep neural networks by providing targeted error protection.

Contribution

The paper proposes a new UEP coding strategy for approximate matrix multiplication in distributed systems, with theoretical error bounds and practical evaluation on neural network training.

Findings

01

Significant reduction in training time with UEP codes

02

Theoretical bounds on reconstruction error for uncorrelated matrices

03

Effective application to deep neural network gradient computation

Abstract

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or poor channel conditions giving rise to the "straggler problem". As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HernandezEduin/UEP-Straggler-Mitigation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.