MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster

Laingjun Feng; Chenyi Pan; Xinjie Guo; Fei Mei; Benzhe Ning; Jianxiang Zhang; Xinyang Liu; Beirong Zhou; Zeng Shu; Chang Liu; Guang Yang; Zhenyu Han; Jiangben Wang; Bo Wang

arXiv:2507.19017·cs.LG·July 28, 2025

MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster

Laingjun Feng, Chenyi Pan, Xinjie Guo, Fei Mei, Benzhe Ning, Jianxiang Zhang, Xinyang Liu, Beirong Zhou, Zeng Shu, Chang Liu, Guang Yang, Zhenyu Han, Jiangben Wang, Bo Wang

PDF

Open Access

TL;DR

MindSpeed RL introduces a distributed dataflow system for scalable and efficient reinforcement learning training on Ascend NPU clusters, significantly improving throughput and resource utilization.

Contribution

The paper presents a novel distributed dataflow architecture for RL training that reduces overhead and memory usage, enabling large-scale training on NPU clusters.

Findings

01

Achieved 1.42 to 3.97 times throughput increase over state-of-the-art systems.

02

Effectively reduced dispatch overhead and redundant memory in RL training.

03

Demonstrated strong performance on large models using Ascend NPUs.

Abstract

Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization. In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training. Unlike existing centralized methods, MindSpeed RL organizes the essential data dependencies in RL training, i.e., sample flow and resharding flow, from a distributed view. On the one hand, a distributed transfer dock strategy, which sets controllers and warehouses on the basis of the conventional replay buffer, is designed to release the dispatch overhead in the sample flow. A practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Reinforcement Learning in Robotics · IoT and Edge/Fog Computing