HeterPS: Distributed Deep Learning With Reinforcement Learning Based   Scheduling in Heterogeneous Environments

Ji Liu; Zhihua Wu; Dianhai Yu; Yanjun Ma; Danlei Feng; Minxu Zhang,; Xinxuan Wu; Xuefeng Yao; Dejing Dou

arXiv:2111.10635·cs.DC·June 8, 2023·6 cites

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Ji Liu, Zhihua Wu, Dianhai Yu, Yanjun Ma, Danlei Feng, Minxu Zhang,, Xinxuan Wu, Xuefeng Yao, Dejing Dou

PDF

Open Access 1 Repo

TL;DR

This paper introduces Paddle-HeterPS, a distributed deep learning framework that uses reinforcement learning to efficiently schedule layers across heterogeneous resources, significantly improving training throughput and reducing costs.

Contribution

The paper presents a novel RL-based scheduling method within a distributed framework for heterogeneous environments, enhancing training efficiency and cost-effectiveness.

Findings

01

Achieves 14.5x higher throughput than existing methods.

02

Reduces monetary cost by 312.3%.

03

Effectively manages data storage and communication.

Abstract

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PaddlePaddle/Paddle
paddleOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data