rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu; Kaijie Zhu; Ying Zhang; Lun Wang; Wenbo Guo

arXiv:2602.07832·cs.LG·May 21, 2026

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

PDF

3 Reviews

TL;DR

rePIRL introduces a novel inverse reinforcement learning framework to learn process reward models for large language model reasoning, addressing limitations of existing methods and enhancing reasoning capabilities.

Contribution

It proposes a dual learning algorithm for PRM and policy that requires minimal assumptions about expert policies, unifying online and offline learning methods.

Findings

01

rePIRL outperforms existing methods on math and coding reasoning datasets.

02

The learned PRM improves test-time training and scaling.

03

Ablation studies validate the effectiveness of key design choices.

Abstract

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper makes a substantial contribution by successfully adapting and scaling classical IRL to the challenging domain of LLM reasoning. While IRL is well-established in robotics and control, its application to LLMs is novel and non-trivial due to the enormous state and action spaces involved. The theoretical unification of existing online and offline PRM methods under a single framework is both creative and insightful, clearly underscoring the minimal-assumption nature of rePIRL. The technical

Weaknesses

While the paper includes several important baselines, the empirical comparison could be further strengthened by incorporating a broader array of recent PRM or preference-based methods, particularly those that similarly operate under minimal supervision. Although the paper ablates the policy learning algorithm, it offers less insight into the PRM's architecture and specific design choices. Given that the reward model is a core component, an ablation study of its capacity or architectural variatio

Reviewer 02Rating 6Confidence 2

Strengths

1. The core contribution is the application of an IRL-inspired framework to learn PRMs. This is well-motivated, as it bypasses the need for explicit token-level reward annotations or access to an expert policy for MCTS, requiring only expert trajectories. 2. The paper provides a strong theoretical contribution by integrating several SOTA methods (DPO, DQO, MCTS, PRIME) into its framework as special cases that require additional assumptions. This analysis in Section 3.3 rigorously supports the c

Weaknesses

1. While rePIRL achieves SOTA average performance among the tested methods, the absolute improvements are modest. For example, on the math benchmarks, rePIRL achieves a 33.5% average, compared to 31.7% for vanilla RLOO and 30.7% for MCTS. These small margins raise questions about the practical utility of the method relative to its complexity. 2. The proposed dual learning algorithm is significantly more complex than the baselines. It requires simultaneously training a policy and a PRM , managin

Reviewer 03Rating 2Confidence 4

Strengths

* The proposed importance weighting method for the IRL algorithm is potentially a nice algorithmic trick.

Weaknesses

* The connection with other SOTA method is a bit superfluous. The main commonalities is just that all methods are more or less based on maximum entropy RL, where the connections are widely known. * IRL for LLM fine tuning has actually been proposed by multiple papers now e.g. [1, 2]. Not citing or comparing with them is missing a lot of context. Especially the proposed algorithm is very similar to [2]. * Considering the paper is algorithmic, it's missing some ablation experiments on the algorith

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Data Stream Mining Techniques