Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin; Yang Zhou; Qixin Zhang; Hongwu Peng; Di Zhang; Zihan Dong; Marco Pavone; Ligong Han; Zhang-Wei Hong; Tong Che; Dimitris N. Metaxas

arXiv:2508.14313·cs.LG·February 10, 2026

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Zihan Dong, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

PDF

TL;DR

This paper introduces AIRL-S, a unified approach that leverages the reward function learned during RL training as an effective process reward model for guiding search in large language models, improving reasoning performance.

Contribution

AIRL-S unifies RL and search-based test-time scaling by deriving a process reward model directly from RL training, eliminating the need for labeled data and enhancing reasoning tasks.

Findings

01

Improves performance by 9% on average across eight benchmarks.

02

Outperforms baseline PRMs trained with labeled data.

03

Enhances robustness and generalization in reasoning tasks.

Abstract

Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.