RewardRank: Optimizing True Learning-to-Rank Utility

Gaurav Bhatt; Kiran Koshy Thekumparampil; Tanmay Gangwani; Tesi Xiao; Leonid Sigal

arXiv:2508.14180·cs.IR·October 21, 2025

RewardRank: Optimizing True Learning-to-Rank Utility

Gaurav Bhatt, Kiran Koshy Thekumparampil, Tanmay Gangwani, Tesi Xiao, Leonid Sigal

PDF

Open Access 3 Reviews

TL;DR

RewardRank is a novel data-driven learning-to-rank framework that directly optimizes true user utility using logged interactions, outperforming traditional proxy-based methods and classical metrics like NDCG.

Contribution

It introduces RewardRank, a new framework for counterfactual utility maximization in ranking, along with two benchmark suites for rigorous evaluation.

Findings

01

RewardRank achieves the highest counterfactual utility on both benchmarks.

02

Optimizing classical metrics like NDCG is sub-optimal for true utility.

03

RewardRank sets a new state-of-the-art in offline relevance performance.

Abstract

Traditional ranking systems optimize offline proxy objectives that rely on oversimplified assumptions about user behavior, often neglecting factors such as position bias and item diversity. Consequently, these models fail to improve true counterfactual utilities such as such as click-through rate or purchase probability, when evaluated in online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first learns a reward model that predicts the utility of any ranking directly from logged user interactions, and then trains a ranker to maximize this reward using a differentiable soft permutation operator. To enable rigorous and reproducible evaluation, we further propose two benchmark suites: (i) Parametric Oracle Evaluation (PO-Eval), which employs an open-source click model as a counterfactual oracle on the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The proposed approach can consider the interaction among items (positions) by using the attention mechanism of the reward model (although only encoding linear interactions). - The two benchmark setups used in the experiment should be useful for other ranking papers and the community, too. Especially, demonstrating that the metrics of expected reward and NDCG may be different can be a useful takeaway. The results also show that the proposed method works well on these benchmarks. - The related

Weaknesses

- The main concern I have is whether the proposed ranker loss adequately addresses the distribution shift issue. In my understanding, the discounting weight ($|u_i - \hat{u}_i|$) is calculated on the observed samples, regardless of the results of the soft-ranked results (i.e., choice of the optimized model). Moreover, downweighting the uncertain sample seems to work well when the logging data contains a high reward value, like an imitation learning. Clarification on this point should be useful.

Reviewer 02Rating 4Confidence 3

Strengths

1. The studied problem is well-motivated. Utility optimization is meaningful in many realistic scenarios. 2. Differentiable permutation modeling with SoftSort is nicely integrated and technically sound. 3. The attempt to create a reproducible counterfactual evaluation is valuable for the community. 4. The writing and presentation are good.

Weaknesses

1. The problem formulation and proposed approach have limited novelty. Training a reward model to estimate the reward of a ranking list in both search and recommendation is common, such as: * "Model-based unbiased learning to rank. D Luo, L Zou, Q Ai, Z Chen, D Yin, BD Davison". Optimizing towards an overall metric, given a list rather than pointwise evaluation of each result, has also been studied. * "Reinforcement Learning to Rank Using Coarse-grained Rewards." Tu, Yiteng; Xu, Zhichao; Yang,

Reviewer 03Rating 2Confidence 4

Strengths

1. Addresses the fundamental limitation of traditional LTR by directly optimizing for complex, list-level user utility, bypassing simplified proxy objectives and effectively capturing permutation-aware behavioral biases. 2. Introduction of PO-Eval and LAU-Eval (LLM-As-User) provides standardized, scalable, and automated benchmarks for counterfactual LTR assessment.

Weaknesses

1. The permutation-aware RewardRank framework, particularly the Reward Model and SoftSort operator, requires significant computational resources. Scalability remains a challenge for very large datasets or long item permutations. 2. Despite the goal of optimizing true utility, the paper lacks crucial real-world A/B test results to directly verify performance gains over baselines, confining its claims to offline/counterfactual assessment. 3. It appears to be missing comparisons against the lates

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification