PARM: Pipeline-Adapted Reward Model

Xingyu Fan; Wei Shao; Jiacheng Liu; Linqi Song; Pheng Ann Heng

arXiv:2604.18327·cs.AI·April 21, 2026

PARM: Pipeline-Adapted Reward Model

Xingyu Fan, Wei Shao, Jiacheng Liu, Linqi Song, Pheng Ann Heng

PDF

TL;DR

This paper introduces PARM, a reward model tailored for multi-stage LLM pipelines, improving alignment and output quality in complex tasks like code generation and optimization.

Contribution

The paper proposes PARM, a novel reward modeling approach that adapts to pipeline-specific data and feedback, enhancing multi-stage LLM pipeline performance.

Findings

01

PARM improves execution rate and accuracy on optimization benchmarks.

02

PARM enhances stability and transferability across domains.

03

Pipeline-specific reward models outperform traditional single-step models.

Abstract

Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.