Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah, Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins

TL;DR
This paper compares process- and outcome-based supervision for training language models on math word problems, finding that process-based feedback improves reasoning accuracy and overall performance.
Contribution
It provides the first comprehensive comparison between process- and outcome-based supervision on GSM8K, demonstrating the benefits of process-based feedback for reasoning accuracy.
Findings
Outcome supervision achieves similar final-answer accuracy with less labeling.
Process supervision or learned reward models are needed for correct reasoning steps.
Improved final-answer error from 16.8% to 12.7%, reasoning error from 14.0% to 3.4%.
Abstract
Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗qgallouedec/Qwen2-0.5B-Rewardmodel· 1 dl1 dl
- 🤗qgallouedec/Qwen2-0.5B-Reward-Math-Sheperdmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.1model· 2 dl2 dl
- 🤗plaguss/Mistral-7B-v0.1-Math-Shepherd-PRM-0.1model· 2 dl2 dl
- 🤗plaguss/Mistral-7B-v0.1-Math-Shepherd-PRM-token-0.1model· 2 dl2 dl
- 🤗plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-token-0.1model· 3 dl3 dl
- 🤗qgallouedec/Qwen2-0.5B-Reward-Math-Sheperd-KN-fix-castmodel· 2 dl2 dl
- 🤗trl-lib/Qwen2-0.5B-Reward-Math-Sheperdmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗plaguss/Qwen2.5-0.5B-Math-Shepherd-PRM-0.2model· 6 dl6 dl
- 🤗plaguss/Mistral-7B-v0.1-Math-Shepherd-PRM-0.2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
