Let's Verify Step by Step

Hunter Lightman; Vineet Kosaraju; Yura Burda; Harri Edwards; Bowen; Baker; Teddy Lee; Jan Leike; John Schulman; Ilya Sutskever; Karl Cobbe

arXiv:2305.20050·cs.LG·June 1, 2023·30 cites

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen, Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

PDF

Open Access 3 Repos 2 Models 5 Datasets 3 Videos 3 Reviews

TL;DR

This paper compares outcome and process supervision for training large language models in multi-step reasoning, finding process supervision more effective and demonstrating the benefits of active learning and releasing a large feedback dataset.

Contribution

It provides a comprehensive comparison of outcome versus process supervision, showing process supervision's superiority in training models on the MATH dataset, and introduces a large human feedback dataset.

Findings

01

Process supervision outperforms outcome supervision on MATH dataset.

02

Active learning enhances process supervision effectiveness.

03

Released PRM800K dataset of human feedback labels.

Abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Assigning rewards for intermediate steps is a novel and intuitive idea. Compared to assigning rewards for the outcome alone, judging intermediate steps better leverages problem structures. 2. The empirical results are very strong. 3. The released dataset can support the research community to further explore this direction.

Weaknesses

1. The reproducibility for this work is concerning. It is hard to understand under what conditions one can successfully train an effective process reward model. The authors did not have sufficient details for both models and data. In the paper, the authors stated, "The small-scale base models are similar in design to GPT-4, but they were pretrained with roughly 200 times less compute." However, the paper neither reveals the size of the small-scale base models nor the large-scale base models.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1, A new dataset with humans verifying each reasoning step. 2. The process supervision is important and the labeled dataset is useful for further research on reasoning with math. 3. The authors have also done experiments with active learning to improve the efficacy.

Weaknesses

1. The work only explores the math problem. It would be better to explore different tasks. 2. The authors haven't applied it to RLHF. It is still not clear how process supervision and outcome supervision affect the generation model performance. Actually, if outcome supervision is large and diverse enough, the model trained with outcome supervision can also do process supervision by feeding the reason path step by step.

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

1. The idea of providing fine-grained feedback for learning models is intuitive and technically sound. 2. They collect a dataset PRM800K that contains step-level labels across different solutions to math problems, which is helpful for future research in this direction. 3. The paper is well-written and easy to follow.

Weaknesses

1. It is a well-known problem in the RL community that providing dense rewards can be better than providing sparse rewards, and there is a research direction on reward shaping that is very relevant to this idea. Therefore, the idea of providing intermediate rewards for training can lack novelty. 2. The paper mainly conducts experiments in a math dataset, where it is easy to provide step-by-step intermediate rewards. However, there are many other tasks such as essay writing and story generation w

Code & Models

Repositories

Models

Datasets

Videos

How OpenAI made o1 "think" – Here is what we think and already know about o1 reinforcement learning· youtube

GPT-5: Everything You Need to Know So Far· youtube

Q* - Clues to the Puzzle?· youtube

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques

MethodsTest