Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Arash Ahmadi, Sarah Sharif, Yaser (Mike) Banad

TL;DR
This paper presents a search-driven reinforcement learning framework that optimizes reward functions to significantly improve mathematical reasoning in large language models, demonstrated on GSM8K.
Contribution
It introduces a novel method for automatically generating and ranking reward functions, leading to substantial performance gains in LLM reasoning tasks.
Findings
Mean F1 improved from 0.596 to 0.632 over five rounds.
Top reward achieved F1 = 0.787, ensemble F1 = 0.795.
Search-driven reward optimization outperforms baseline rewards.
Abstract
Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
