TL;DR
This paper introduces a novel asynchronous RL training framework for LLMs that doubles throughput without sacrificing on-policy correctness, using a periodic synchronization approach and efficient architecture design.
Contribution
It proposes a periodically asynchronous, on-policy RL training framework that improves efficiency and throughput while maintaining algorithm compatibility and correctness.
Findings
Approximately 2x throughput improvement on NPU platforms
Up to 3x speedup on GPU platforms
Maintains on-policy correctness without off-policy bias
Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention for LLM post-training, yet training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are co-located on the same devices, and their synchronous execution prevents concurrent inference and training. In this work, we revisit the strategy of separating inference and training deployment, and propose a periodically asynchronous framework that transforms synchronous RL training into an asynchronous producer-consumer pipeline. By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently on-policy -- without any modification to standard RL algorithms -- thereby avoiding the off-policy bias introduced by existing asynchronous approaches. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
