Exploring the Limit of Outcome Reward for Learning Mathematical   Reasoning

Chengqi Lyu; Songyang Gao; Yuzhe Gu; Wenwei Zhang; Jianfei Gao; Kuikun; Liu; Ziyi Wang; Shuaibin Li; Qian Zhao; Haian Huang; Weihan Cao; Jiangning; Liu; Hongwei Liu; Junnan Liu; Songyang Zhang; Dahua Lin; Kai Chen

arXiv:2502.06781·cs.CL·February 11, 2025

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun, Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning, Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen

PDF

Open Access 1 Repo 6 Models 1 Datasets

TL;DR

This paper introduces OREAL, a reinforcement learning framework using outcome rewards for mathematical reasoning, achieving state-of-the-art accuracy with significantly smaller models by leveraging binary feedback and token-level rewards.

Contribution

It proposes a novel RL approach with outcome rewards for math reasoning, including theoretical analysis and practical techniques that improve performance of smaller models.

Findings

01

7B model achieves 94.0% pass@1 on MATH-500

02

32B model surpasses previous distillation-based models with 95.0% pass@1

03

Token-level reward sampling enhances learning in sparse reward environments

Abstract

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

internlm/oreal
pytorchOfficial

Models

Datasets

internlm/OREAL-RL-Prompts
dataset· 240 dl
240 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics Education and Pedagogy · Mathematics Education and Teaching Techniques