TL;DR
ReasonFlux-PRM is a trajectory-aware reward model designed to evaluate and improve the reasoning process of large language models by incorporating step and trajectory supervision, leading to better data selection and performance in reasoning tasks.
Contribution
This work introduces ReasonFlux-PRM, a novel trajectory-aware PRM that evaluates reasoning traces and supports reward supervision in offline and online settings, improving model training and inference.
Findings
ReasonFlux-PRM-7B outperforms strong PRMs and human baselines in data quality.
Achieves 12.1% improvement in supervised fine-tuning performance.
Provides 4.5% gains in reinforcement learning and 6.3% in test-time scaling.
Abstract
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
