RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zeng Zhiyuan; Jiashuo Liu; Zhangyue Yin; Ge Zhang; Wenhao Huang; Xipeng Qiu

arXiv:2511.04285·cs.AI·November 7, 2025

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu

PDF

Open Access

TL;DR

RLoop is a self-improving reinforcement learning framework that iteratively refines policies through exploration and fine-tuning, significantly enhancing generalization and robustness over standard RL methods.

Contribution

It introduces a novel iterative policy initialization framework that mitigates overfitting and catastrophic forgetting in RL by leveraging policy diversity and expert dataset filtering.

Findings

01

Boosts average accuracy by 9% over vanilla RL

02

Improves pass@32 metric by over 15%

03

Reduces policy over-specialization and forgetting

Abstract

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications