Effective Reinforcement Learning for Reasoning in Language Models

Lianghuan Huang; Shuo Li; Sagnik Anupam; Insup Lee; Osbert Bastani

arXiv:2505.17218·cs.AI·May 26, 2025

Effective Reinforcement Learning for Reasoning in Language Models

Lianghuan Huang, Shuo Li, Sagnik Anupam, Insup Lee, Osbert Bastani

PDF

TL;DR

This paper investigates reinforcement learning strategies tailored for language model reasoning, demonstrating that on-policy RL and the DASH algorithm significantly improve accuracy and efficiency in small models.

Contribution

The paper analyzes RL design choices for LM reasoning and introduces DASH, a novel algorithm that enhances training efficiency without sacrificing accuracy.

Findings

01

On-policy RL outperforms supervised fine-tuning.

02

PPO-based off-policy updates increase accuracy.

03

Removing KL divergence improves generation conciseness and accuracy.

Abstract

Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding. However, most modern RL algorithms were designed to target robotics applications, which differ significantly from LM reasoning. We analyze RL algorithm design decisions for LM reasoning, for both accuracy and computational efficiency, focusing on relatively small models due to computational constraints. Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy. Furthermore, we find that a key bottleneck to computational efficiency is that the optimal batch sizes for inference and backpropagation are different. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.