Exploring the Potential of Offline RL for Reasoning in LLMs: A   Preliminary Study

Xiaoyu Tian; Sitong Zhao; Haotian Wang; Shuaiting Chen; Yiping Peng,; Yunjie Ji; Han Zhao; Xiangang Li

arXiv:2505.02142·cs.CL·May 6, 2025

Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng,, Yunjie Ji, Han Zhao, Xiangang Li

PDF

Open Access 1 Datasets

TL;DR

This study explores the use of offline reinforcement learning methods, specifically DPO and LD-DPO, to improve reasoning in large language models, achieving notable performance gains with lower computational costs.

Contribution

It demonstrates that simple offline RL methods can effectively enhance reasoning in LLMs, providing empirical evidence and practical insights for cost-effective model training.

Findings

01

Offline RL methods improve reasoning performance by 3.3% on average.

02

DPO increases performance by 10.1% on Arena-Hard benchmark.

03

Increasing reasoning length should match semantic richness to avoid performance decline.

Abstract

Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibndias/DeepSeek-Distilled-40M
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Software Engineering Research

MethodsALIGN