Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
Simon Matrenok, Skander Moalla, Caglar Gulcehre

TL;DR
QRPO introduces a novel offline RL method that learns from pointwise absolute rewards using quantile rewards, enabling scalable, exact partition functions and achieving top performance in language and coding tasks.
Contribution
QRPO bridges the gap between pointwise reward learning and offline methods by using quantile rewards for exact partition functions, scalable estimation, and improved performance.
Findings
QRPO outperforms DPO, REBEL, and SimPO on chat, coding, and benchmark tasks.
QRPO scales with compute to estimate quantile rewards effectively.
Training with robust rewards reduces length bias in language models.
Abstract
Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- skandermoalla/qrpo-paper-llama-nosft-leetcode-sandbox-temp1-ref50-offline-sandboxdataset· 18 dl18 dl
- skandermoalla/qrpo-paper-llama-nosft-leetcode-sandbox-temp1-ref50-offpolicy10random-sandboxdataset· 11 dl11 dl
- skandermoalla/qrpo-paper-llama-sft-leetcode-sandbox-temp1-ref50-offline-sandboxdataset· 15 dl15 dl
- skandermoalla/qrpo-paper-llama-sft-leetcode-sandbox-temp1-ref50-offpolicy10random-sandboxdataset· 13 dl13 dl
- skandermoalla/qrpo-paper-llama-nosft-magpieair-armorm-temp1-ref50-offline-armormdataset· 22 dl22 dl
Videos
