TL;DR
This paper introduces a novel approach using Unequal Error Protection (UEP) codes to mitigate stragglers in distributed matrix multiplication, improving training time for deep neural networks by providing targeted error protection.
Contribution
The paper proposes a new UEP coding strategy for approximate matrix multiplication in distributed systems, with theoretical error bounds and practical evaluation on neural network training.
Findings
Significant reduction in training time with UEP codes
Theoretical bounds on reconstruction error for uncorrelated matrices
Effective application to deep neural network gradient computation
Abstract
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or poor channel conditions giving rise to the "straggler problem". As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
