PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment
Jiawei Li, Xinyue Liang, Junlong Zhang, Yizhe Yang, Chong Feng, Yang Gao

TL;DR
This paper introduces PSPO*, a new process supervision framework for reasoning tasks in large language models, emphasizing nonlinear reward shaping to improve reasoning accuracy and reduce errors.
Contribution
The paper proposes PSPO*, a novel process supervision paradigm that incorporates nonlinear reward functions based on reasoning steps, enhancing reasoning performance.
Findings
PSPO-WRS outperforms existing models on six reasoning datasets.
Nonlinear reward shaping improves reasoning accuracy.
Considering reasoning steps in reward design benefits model performance.
Abstract
Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
