How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR
Richa Verma, Balaraman Ravindran

TL;DR
This paper introduces G2D, a three-stage pipeline combining short online RL warm-up and offline DPO fine-tuning, which achieves comparable or better performance than online RL methods at lower computational cost.
Contribution
The authors propose G2D, a novel approach that reduces online RL reliance by using a short warm-up phase to create informative offline datasets for effective preference optimization.
Findings
Offline DPO with moderate warm-up matches or outperforms online RL at lower compute.
Performance depends on data informativeness, not the number of preference pairs.
Moderate warm-up yields calibrated uncertainty and stronger contrastive signals.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
