Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok; Skander Moalla; Caglar Gulcehre

arXiv:2507.08068·cs.LG·December 2, 2025

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok, Skander Moalla, Caglar Gulcehre

PDF

5 Datasets 1 Video

TL;DR

QRPO introduces a novel offline RL method that learns from pointwise absolute rewards using quantile rewards, enabling scalable, exact partition functions and achieving top performance in language and coding tasks.

Contribution

QRPO bridges the gap between pointwise reward learning and offline methods by using quantile rewards for exact partition functions, scalable estimation, and improved performance.

Findings

01

QRPO outperforms DPO, REBEL, and SimPO on chat, coding, and benchmark tasks.

02

QRPO scales with compute to estimate quantile rewards effectively.

03

Training with robust rewards reduces length bias in language models.

Abstract

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions· slideslive