Distributed Optimization using Heterogeneous Compute Systems

Vineeth S

arXiv:2110.08941·cs.LG·October 19, 2021

Distributed Optimization using Heterogeneous Compute Systems

Vineeth S

PDF

Open Access 1 Repo

TL;DR

This paper addresses the challenge of training deep neural networks on heterogeneous distributed systems by dynamically adjusting data partitions based on each worker's compute power, leading to improved efficiency and reduced training time.

Contribution

It introduces a novel method for dynamically partitioning data among workers with varying compute capabilities to optimize distributed training.

Findings

01

Dynamic data partitioning improves system utilization.

02

Training time is significantly reduced.

03

Method is effective in heterogeneous hardware environments.

Abstract

Hardware compute power has been growing at an unprecedented rate in recent years. The utilization of such advancements plays a key role in producing better results in less time -- both in academia and industry. However, merging the existing hardware with the latest hardware within the same ecosystem poses a challenging task. One of the key challenges, in this case, is varying compute power. In this paper, we consider the training of deep neural networks on a distributed system of workers with varying compute power. A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing. To mitigate this issue, we propose to dynamically adjust the data assigned for each worker during the training. We assign each worker a partition of total data proportional to its computing power. Our experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vineeths96/heterogeneous-systems
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems