Training Large Language Models for Reasoning through Reverse Curriculum   Reinforcement Learning

Zhiheng Xi; Wenxiang Chen; Boyang Hong; Senjie Jin; Rui Zheng; Wei He,; Yiwen Ding; Shichun Liu; Xin Guo; Junzhe Wang; Honglin Guo; Wei Shen; Xiaoran; Fan; Yuhao Zhou; Shihan Dou; Xiao Wang; Xinbo Zhang; Peng Sun; Tao Gui; Qi; Zhang; Xuanjing Huang

arXiv:2402.05808·cs.AI·March 19, 2024·2 cites

Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui Zheng, Wei He,, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran, Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, Qi, Zhang, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces R$^3$, a novel reinforcement learning method that uses outcome supervision and reverse curriculum learning to improve reasoning in large language models, reducing manual annotation needs.

Contribution

R$^3$ is the first approach to combine outcome supervision with reverse curriculum RL for large language models, enabling step-wise reasoning without extensive manual annotations.

Findings

01

R$^3$ outperforms RL baselines on eight reasoning tasks by 4.1 points.

02

In GSM8K, R$^3$ exceeds baseline performance by 4.2 points.

03

Without extra data, R$^3$ with Codellama-7B matches larger or closed-source models.

Abstract

In this paper, we propose R $^{3}$ : Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R $^{3}$ overcomes these limitations by learning from correct demonstrations. Specifically, R $^{3}$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R $^{3}$ establishes a step-wise curriculum,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

woooodyy/llm-reverse-curriculum-rl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics