PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary
Jiarui Yao, Ruida Wang, Tong Zhang

TL;DR
PRL introduces a theoretically grounded method for fine-grained process supervision in reinforcement learning, significantly enhancing LLMs' reasoning capabilities and expanding their reasoning boundaries.
Contribution
It proposes Process Reward Learning (PRL), a novel framework that decomposes RL objectives into intermediate steps with rigorous process rewards, improving reasoning in LLMs.
Findings
PRL improves LLM reasoning performance measured by average @ n.
PRL broadens reasoning boundaries as shown by pass @ n metrics.
Extensive experiments confirm PRL's effectiveness and generalizability.
Abstract
Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
