Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

Arash Ahmadi; Sarah Sharif; Yaser (Mike) Banad

arXiv:2605.02073·cs.CL·May 11, 2026

Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

Arash Ahmadi, Sarah Sharif, Yaser (Mike) Banad

PDF

TL;DR

This paper presents a search-driven reinforcement learning framework that optimizes reward functions to significantly improve mathematical reasoning in large language models, demonstrated on GSM8K.

Contribution

It introduces a novel method for automatically generating and ranking reward functions, leading to substantial performance gains in LLM reasoning tasks.

Findings

01

Mean F1 improved from 0.596 to 0.632 over five rounds.

02

Top reward achieved F1 = 0.787, ensemble F1 = 0.795.

03

Search-driven reward optimization outperforms baseline rewards.

Abstract

Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.