From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Bin Xie; Bingbing Xu; Yige Yuan; Shengmao Zhu; Huawei Shen

arXiv:2506.12446·cs.CL·July 1, 2025

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen

PDF

Open Access

TL;DR

This paper introduces process reward models (PRMs) to improve inference-time alignment of large language models by ensuring consistency in partial and complete response evaluations, leading to better alignment with human preferences.

Contribution

The paper proposes SP-PRM, a dual-consistency framework that integrates score and preference consistency in process reward models without human annotation, addressing the granularity mismatch in existing methods.

Findings

01

SP-PRM improves GPT-4 evaluation scores by 3.6%-10.3% across tasks.

02

Extensive experiments on dialogue, summarization, and reasoning validate the effectiveness.

03

Addresses the granularity mismatch between outcome and process rewards in RGS methods.

Abstract

Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications