Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun, Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning, Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen

TL;DR
This paper introduces OREAL, a reinforcement learning framework using outcome rewards for mathematical reasoning, achieving state-of-the-art accuracy with significantly smaller models by leveraging binary feedback and token-level rewards.
Contribution
It proposes a novel RL approach with outcome rewards for math reasoning, including theoretical analysis and practical techniques that improve performance of smaller models.
Findings
7B model achieves 94.0% pass@1 on MATH-500
32B model surpasses previous distillation-based models with 95.0% pass@1
Token-level reward sampling enhances learning in sparse reward environments
Abstract
Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics Education and Pedagogy · Mathematics Education and Teaching Techniques
