CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling
Zekai Qu, Yinxu Pan, Ao Sun, Chaojun Xiao, Xu Han

TL;DR
CoPRIS introduces a concurrency-controlled partial rollout method with importance sampling for reinforcement learning in large language models, significantly improving training efficiency while maintaining performance.
Contribution
It proposes a novel asynchronous RL framework with importance sampling correction, reducing training time and GPU idle time in LLM reinforcement learning.
Findings
Achieves up to 1.94x faster training
Maintains comparable or better performance
Effective in mathematical reasoning benchmarks
Abstract
Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
