SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand   Cores

Zhiyu Mei; Wei Fu; Jiaxuan Gao; Guangju Wang; Huanchen Zhang; Yi Wu

arXiv:2306.16688·cs.DC·June 24, 2024

SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Zhiyu Mei, Wei Fu, Jiaxuan Gao, Guangju Wang, Huanchen Zhang, Yi Wu

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces SRL, a scalable distributed reinforcement learning system that significantly improves training throughput and enables large-scale experiments with over 15,000 CPU cores, addressing limitations of existing libraries.

Contribution

The paper presents a novel abstraction for RL dataflows and a scalable system, ReaLlyScalableRL, facilitating efficient large-scale RL training and customization.

Findings

01

Up to 21x higher training throughput compared to existing libraries.

02

Successfully scaled RL experiments to over 15,000 CPU cores.

03

Achieved up to 5x speedup in wall-clock time on complex environments.

Abstract

The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Software Engineering Research · Evolutionary Algorithms and Applications