Distributed Optimization using Heterogeneous Compute Systems
Vineeth S

TL;DR
This paper addresses the challenge of training deep neural networks on heterogeneous distributed systems by dynamically adjusting data partitions based on each worker's compute power, leading to improved efficiency and reduced training time.
Contribution
It introduces a novel method for dynamically partitioning data among workers with varying compute capabilities to optimize distributed training.
Findings
Dynamic data partitioning improves system utilization.
Training time is significantly reduced.
Method is effective in heterogeneous hardware environments.
Abstract
Hardware compute power has been growing at an unprecedented rate in recent years. The utilization of such advancements plays a key role in producing better results in less time -- both in academia and industry. However, merging the existing hardware with the latest hardware within the same ecosystem poses a challenging task. One of the key challenges, in this case, is varying compute power. In this paper, we consider the training of deep neural networks on a distributed system of workers with varying compute power. A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing. To mitigate this issue, we propose to dynamically adjust the data assigned for each worker during the training. We assign each worker a partition of total data proportional to its computing power. Our experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
