RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu

TL;DR
RLoop is a self-improving reinforcement learning framework that iteratively refines policies through exploration and fine-tuning, significantly enhancing generalization and robustness over standard RL methods.
Contribution
It introduces a novel iterative policy initialization framework that mitigates overfitting and catastrophic forgetting in RL by leveraging policy diversity and expert dataset filtering.
Findings
Boosts average accuracy by 9% over vanilla RL
Improves pass@32 metric by over 15%
Reduces policy over-specialization and forgetting
Abstract
While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
