Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Jan Niklas Groeneveld, Xi Qin, Alexander Schaefer, Yaad Oren

TL;DR
This paper demonstrates that small language models can be effectively transformed into reward models for code generation, improving search and evaluation capabilities without requiring large models.
Contribution
It introduces a method to turn small LLMs into reward models for code, combining process and outcome rewards, and shows their effectiveness in code evaluation tasks.
Findings
Small LLMs can serve as effective reward models.
Using these reward models improves code search by over 20%.
Small models can evaluate code correctness accurately.
Abstract
Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
