Bridging Offline and Online Reinforcement Learning for LLMs
Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov

TL;DR
This paper explores reinforcement learning techniques for fine-tuning large language models across offline, semi-online, and online regimes, demonstrating that online methods outperform offline ones and that multi-tasking enhances performance.
Contribution
It introduces a comprehensive comparison of online and semi-online reinforcement learning methods for LLM fine-tuning, highlighting their similar performance and benefits of multi-tasking.
Findings
Online and semi-online methods perform similarly and outperform offline methods.
Multi-tasking with verifiable and non-verifiable rewards improves overall performance.
Hyperparameter strategies are crucial for optimal training dynamics.
Abstract
We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper provides an interesting premise where the main point of the study is to see how off-policy one can go before incurring a loss. The y find that one can go somewhat but not fully off-policy which makes sense. They also evaluate the effectiveness of using verifiable and -non verifiable rewards.
The paper's premise is good but I think lacks a bit of substance. The paper reports a single model on three benchmarks and a single seed. I would like to see multiple seeds and another model. LLM judge as the only evaluation method is in my opinion not enough, I would like to see (even if small) a human study to evaluate the methods. As far as I can tell the anchor for evaluation is always the base model, it would be nice to see ELO rating to see how the models compare against each other.
1. It is a timely and important investigation. The topic directly addresses a practical compute bottleneck for the online GRPO, and it shows trade offs compared to other approaches. The strong performance gain of semi-offline and online DPO shows its viability. 2. It is original. It provides a thorough comparison of the DPO and GRPO which are two main RL methods for LLM post-training, and it provides a new perspective on semi-online DPO as a viable alternative to GRPO. 3. Ablation experiments
1. The semi-online step sizes differ across verifiable (s = 10, 100) and non-verifiable (s = 5, 10) settings without clear justification. 2. The study relies on a single model family. It would be much more helpful to show the Qwen model family performance on the verifiable tasks. Right now there are only Qwen results on the non-verifiable evaluations, which makes it difficult to compare to other literature. 3. The claimed training efficiency of semi-online DPO remains qualitative. Quantitative
1. Offers a clear and unified view of offline, semi-online, and online RL using one parameter s; brings together results across verifiable and non-verifiable tasks. 2. Writing is clear and structured; methods and results are easy to follow despite some minor presentation issues. 3. Gives useful, practical insights for efficient LLM post-training; shows semi-online DPO can match online RL and that combining rewards improves generalization.
**Contribution:** While the paper’s exploration of semi-online and online DPO is valuable, its conceptual novelty is limited. Prior works (e.g., Xu et al., 2023b; Xiong et al., 2023b; Chen et al., 2024b; Yuan et al., 2024; Qi et al., 2024; Guo et al., 2024) have already proposed iterative and fully online variants of DPO and demonstrated that they outperform offline methods. Therefore, the idea of bridging DPO across offline and online regimes is not fundamentally new. Nonetheless, this paper r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms · Collaboration in agile enterprises
MethodsSparse Evolutionary Training
