Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar; Vincent Zhuang; Rishabh Agarwal; Yi Su; John D Co-Reyes,; Avi Singh; Kate Baumli; Shariq Iqbal; Colton Bishop; Rebecca Roelofs; Lei M; Zhang; Kay McKinney; Disha Shrivastava; Cosmin Paduraru; George Tucker; Doina; Precup; Feryal Behbahani; Aleksandra Faust

arXiv:2409.12917·cs.LG·October 7, 2024·6 cites

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes,, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M, Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina, Precup, Feryal Behbahani, Aleksandra Faust

PDF

Open Access 2 Repos 1 Models 3 Reviews

TL;DR

This paper introduces SCoRe, a reinforcement learning method enabling large language models to self-correct more effectively using only self-generated data, surpassing previous approaches in performance.

Contribution

The paper presents a novel multi-turn RL approach, SCoRe, that improves LLM self-correction without relying on multiple models or extra supervision, addressing distribution mismatch and behavior collapse issues.

Findings

01

SCoRe improves self-correction by 15.6% on MATH.

02

SCoRe improves self-correction by 9.1% on HumanEval.

03

State-of-the-art self-correction performance achieved.

Abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse,…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

1. First approach for making self-correction really work. 2. Very solid experiments and ablation studies along with in-depth analysis providing insights for achieving inference time scaling like OpenAI's o1 series.

Weaknesses

1. This work conduct experiments on private Gemini series which is hard to reproduce, it would be beneficial to include experiments on open-source models (Llama 3). 2. This work only explore 3 datasets (HumanEval, MBPP and MATH) on code and math. It would be better to introduce more datasets of varying difficulty levels (e.g. AIME). 3. Also, it would be better to conduct experiments on a broader range of diverse subject (e.g. Physics, Chemistry).

Reviewer 02Rating 8Confidence 3

Strengths

1. The work identifies and studies two limitations of existing self-correction methods: distribution shift and behavior collapse. 2. The work proposes a novel and original multi-turn RL method. The method's significance lies in its potential to address key limitations of existing approaches. 3. The quality of empirical analysis is good by showing improvements in self-correction metrics on established datasets and ablation studies on various components of the proposed method. 4. The work is

Weaknesses

1. No experiments are conducted with open-source models such as the Llama series. 2. The models are trained for only two attempts, leaving the scalability of the proposed method to additional attempts uncertain.

Reviewer 03Rating 8Confidence 2

Strengths

1. The paper's analysis of the failure modes for prior SFT-based methods is very insightful, with the authors making use of edit-distance-based metric and an analysis of train-test differences to understand why prior methods fail to learn self-correction or fail to generalize out-of-distribution. 2. The results appear relatively strong, with SCoRe substantially outperforming prior methods on the evaluations presented. 3. The presentation is overall relatively clear.

Weaknesses

1. Certain choices in the technique don't appear to be "as simple as possible," and the text doesn't consistently do a good job of motivating these choices. (See questions.) 2. I would like to see these results compared to the very simple baseline of RL directly against the final answer, but with the self-correction prompt inserted after the first turn.

Code & Models

Repositories

Models

🤗
codelion/scorelora
model· 2 dl· ♡ 3
2 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques

MethodsShrink and Fine-Tune · Balanced Selection