Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Rachel Ma, Dylan Hadfield-Menell, Kristjan Greenewald

TL;DR
This paper introduces a novel calibration method for Process Reward Models using conditional optimal transport, improving confidence estimates and performance in mathematical reasoning benchmarks.
Contribution
It applies conditional optimal transport to calibrate PRMs, providing structural guarantees and flexible uncertainty estimation, a first in this context.
Findings
Significantly improves calibration of PRMs on reasoning benchmarks.
Enhances downstream performance in instance-adaptive scaling.
Offers a principled approach with structural guarantees for PRM calibration.
Abstract
Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
