SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
Zhiyu Mei, Wei Fu, Jiaxuan Gao, Guangju Wang, Huanchen Zhang, Yi Wu

TL;DR
This paper introduces SRL, a scalable distributed reinforcement learning system that significantly improves training throughput and enables large-scale experiments with over 15,000 CPU cores, addressing limitations of existing libraries.
Contribution
The paper presents a novel abstraction for RL dataflows and a scalable system, ReaLlyScalableRL, facilitating efficient large-scale RL training and customization.
Findings
Up to 21x higher training throughput compared to existing libraries.
Successfully scaled RL experiments to over 15,000 CPU cores.
Achieved up to 5x speedup in wall-clock time on complex environments.
Abstract
The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Software Engineering Research · Evolutionary Algorithms and Applications
