Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Jing Liang; Hongyao Tang; Yi Ma; Jinyi Liu; Yan Zheng; Shuyue Hu; Lei Bai; Jianye Hao

arXiv:2507.06892·cs.LG·July 14, 2025

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

PDF

2 Models

TL;DR

This paper introduces ReMix, an off-policy reinforcement finetuning method for large language models that significantly reduces training costs while maintaining state-of-the-art performance on math reasoning benchmarks.

Contribution

ReMix enables off-policy RL in LLM finetuning, improving efficiency and performance over existing on-policy methods like PPO and GRPO.

Findings

01

ReMix achieves SOTA performance with over 30x to 450x less training data.

02

Off-policy discrepancy influences response length preferences.

03

Severe off-policyness can cause collapse in self-reflection behavior.

Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.