The Bidirectional Process Reward Model
Lingyin Zhang, Jun Gao, Xiaoxue Ren, Ziqiang Cao

TL;DR
The paper introduces BiPRM, a bidirectional process reward model that evaluates reasoning steps in both directions to improve the accuracy and robustness of reward assessments in large language models.
Contribution
It proposes a novel bidirectional evaluation paradigm with a simple gating mechanism, significantly enhancing process reward modeling with minimal additional parameters and latency.
Findings
BiPRM outperforms unidirectional models across multiple benchmarks.
Achieves an average 10.6% relative gain in solution quality.
Demonstrates robustness and broad applicability in diverse settings.
Abstract
Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs). However, most existing PRMs rely on a unidirectional left-to-right (L2R) evaluation scheme, which restricts their utilization of global context. In light of this challenge, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM incorporates a parallel right-to-left (R2L) evaluation stream, implemented via prompt reversal, alongside the conventional L2R flow. Then a gating mechanism is introduced to adaptively fuse the reward scores from both streams to yield a holistic quality assessment. Remarkably, compared to the original PRM, BiPRM introduces only a 0.3% parameter increase for the gating module, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
