Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study
Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng,, Yunjie Ji, Han Zhao, Xiangang Li

TL;DR
This study explores the use of offline reinforcement learning methods, specifically DPO and LD-DPO, to improve reasoning in large language models, achieving notable performance gains with lower computational costs.
Contribution
It demonstrates that simple offline RL methods can effectively enhance reasoning in LLMs, providing empirical evidence and practical insights for cost-effective model training.
Findings
Offline RL methods improve reasoning performance by 3.3% on average.
DPO increases performance by 10.1% on Arena-Hard benchmark.
Increasing reasoning length should match semantic richness to avoid performance decline.
Abstract
Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Software Engineering Research
MethodsALIGN
