Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun,, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro, Nakano, Christopher Hesse, John Schulman

TL;DR
This paper introduces GSM8K, a diverse dataset of math word problems, and proposes training verifiers to improve model accuracy in solving these problems, demonstrating that verification enhances performance more effectively than finetuning.
Contribution
The paper presents a new dataset GSM8K and a verification-based approach to improve math problem-solving accuracy in language models, showing verification's scalability over finetuning.
Findings
Verification improves accuracy on GSM8K.
Verification scales better with more data than finetuning.
Large models still struggle with multi-step math reasoning.
Abstract
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-4b-itmodel· 1.5M dl· ♡ 12721.5M dl♡ 1272
- 🤗google/gemma-3-27b-itmodel· 1.0M dl· ♡ 19401.0M dl♡ 1940
- 🤗unsloth/gemma-3-12b-it-GGUFmodel· 101k dl· ♡ 178101k dl♡ 178
- 🤗google/gemma-3-1b-itmodel· 1.4M dl· ♡ 8991.4M dl♡ 899
- 🤗google/gemma-3-12b-it-qat-q4_0-ggufmodel· 7.1k dl· ♡ 2627.1k dl♡ 262
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-7bmodel· 30k dl· ♡ 329330k dl♡ 3293
- 🤗google/gemma-2-2b-itmodel· 368k dl· ♡ 1314368k dl♡ 1314
- 🤗google/gemma-3-12b-itmodel· 2.6M dl· ♡ 6982.6M dl♡ 698
- 🤗google/gemma-3-12b-it-qat-q4_0-unquantizedmodel· 28k dl· ♡ 8128k dl♡ 81
Videos
[ML News] Google introduces Pathways | OpenAI solves Math Problems | Meta goes First Person· youtube
o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know· youtube
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsTest
