On the Fundamental Limits of Coded Data Shuffling for Distributed Machine Learning
Adel Elmahdy, Soheil Mohajer

TL;DR
This paper establishes the fundamental limits of data shuffling in distributed machine learning, providing exact trade-offs between communication load and storage, and introduces an optimal coded shuffling scheme that leverages cache memories.
Contribution
It characterizes the exact load-memory trade-off for worst-case data shuffling and proposes an optimal coded shuffling scheme with proven optimality.
Findings
Exact load-memory trade-off for worst-case shuffling derived.
Proposed coded shuffling scheme outperforms existing methods.
Optimality of the scheme proven through matching lower bounds.
Abstract
We consider the data shuffling problem in a distributed learning system, in which a master node is connected to a set of worker nodes, via a shared link, in order to communicate a set of files to the worker nodes. The master node has access to a database of files. In every shuffling iteration, each worker node processes a new subset of files, and has excess storage to partially cache the remaining files, assuming the cached files are uncoded. The caches of the worker nodes are updated every iteration, and they should be designed to satisfy any possible unknown permutation of the files in subsequent iterations. For this problem, we characterize the exact load-memory trade-off for worst-case shuffling by deriving the minimum communication load for a given storage capacity per worker node. As a byproduct, the exact load-memory trade-off for any shuffling is characterized when the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
