PARM: Pipeline-Adapted Reward Model
Xingyu Fan, Wei Shao, Jiacheng Liu, Linqi Song, Pheng Ann Heng

TL;DR
This paper introduces PARM, a reward model tailored for multi-stage LLM pipelines, improving alignment and output quality in complex tasks like code generation and optimization.
Contribution
The paper proposes PARM, a novel reward modeling approach that adapts to pipeline-specific data and feedback, enhancing multi-stage LLM pipeline performance.
Findings
PARM improves execution rate and accuracy on optimization benchmarks.
PARM enhances stability and transferability across domains.
Pipeline-specific reward models outperform traditional single-step models.
Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
