TL;DR
This paper introduces ReMix, an off-policy reinforcement finetuning method for large language models that significantly reduces training costs while maintaining state-of-the-art performance on math reasoning benchmarks.
Contribution
ReMix enables off-policy RL in LLM finetuning, improving efficiency and performance over existing on-policy methods like PPO and GRPO.
Findings
ReMix achieves SOTA performance with over 30x to 450x less training data.
Off-policy discrepancy influences response length preferences.
Severe off-policyness can cause collapse in self-reflection behavior.
Abstract
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
