ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning
Artavazd Maranjyan, El Mehdi Saad, Peter Richt\'arik, Francesco Orabona

TL;DR
This paper introduces ATA, an adaptive method for task allocation in distributed machine learning that optimizes resource use and reduces costs by accounting for heterogeneous worker computation times without prior knowledge.
Contribution
ATA is a novel adaptive task allocation algorithm that achieves near-optimal resource efficiency in distributed training without prior computation time knowledge.
Findings
ATA identifies optimal task allocation theoretically.
ATA reduces resource costs significantly in experiments.
Performance matches methods with prior knowledge.
Abstract
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed and Parallel Computing Systems
