Training Verifiers to Solve Math Word Problems

Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun,; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro; Nakano; Christopher Hesse; John Schulman

arXiv:2110.14168·cs.LG·November 19, 2021·23 cites

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun,, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro, Nakano, Christopher Hesse, John Schulman

PDF

Open Access 5 Repos 10 Models 5 Datasets 2 Videos

TL;DR

This paper introduces GSM8K, a diverse dataset of math word problems, and proposes training verifiers to improve model accuracy in solving these problems, demonstrating that verification enhances performance more effectively than finetuning.

Contribution

The paper presents a new dataset GSM8K and a verification-based approach to improve math problem-solving accuracy in language models, showing verification's scalability over finetuning.

Findings

01

Verification improves accuracy on GSM8K.

02

Verification scales better with more data than finetuning.

03

Large models still struggle with multi-step math reasoning.

Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

[ML News] Google introduces Pathways | OpenAI solves Math Problems | Meta goes First Person· youtube

o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know· youtube

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsTest