How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

Richa Verma; Balaraman Ravindran

arXiv:2605.21266·cs.LG·May 21, 2026

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

Richa Verma, Balaraman Ravindran

PDF

TL;DR

This paper introduces G2D, a three-stage pipeline combining short online RL warm-up and offline DPO fine-tuning, which achieves comparable or better performance than online RL methods at lower computational cost.

Contribution

The authors propose G2D, a novel approach that reduces online RL reliance by using a short warm-up phase to create informative offline datasets for effective preference optimization.

Findings

01

Offline DPO with moderate warm-up matches or outperforms online RL at lower compute.

02

Performance depends on data informativeness, not the number of preference pairs.

03

Moderate warm-up yields calibrated uncertainty and stronger contrastive signals.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.