R1-Ranker: Teaching LLM Rankers to Reason

Tao Feng; Zhigang Hua; Zijie Lei; Yan Xie; Shuang Yang; Bo Long; Jiaxuan You

arXiv:2506.21638·cs.IR·October 17, 2025

R1-Ranker: Teaching LLM Rankers to Reason

Tao Feng, Zhigang Hua, Zijie Lei, Yan Xie, Shuang Yang, Bo Long, Jiaxuan You

PDF

1 Repo 3 Reviews

TL;DR

This paper introduces R1-Ranker, a reinforcement learning framework that enhances large language models' reasoning abilities for ranking tasks across various domains, achieving state-of-the-art results and demonstrating the importance of iterative reasoning.

Contribution

The paper presents R1-Ranker, a novel reasoning-incentive framework with two designs, improving LLM ranking performance through reinforcement learning and iterative reasoning across multiple datasets.

Findings

01

IRanker-3B achieves state-of-the-art performance.

02

Over 15.7% average relative improvement across datasets.

03

Zero-shot out-of-domain performance improves by over 9%.

Abstract

Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is on an important topic and one that needs more exploration on. - The paper evaluates on several tasks and has a lot of baseline models. - The paper is clear to read, if missing details

Weaknesses

I have two main concerns with this paper: (1) The evaluations (as well as training) are unclear and seem unconventional. I am not an expert in recommender systems but do have experience in passage ranking and the setup for MS MARCO is very unusual. I am also not sure how the training/test split is done: there are 9 datasets but it seems the model is trained on all of them and then tested on them as well? This runs counter to the paper's premise to create a model that generalizes, as all the te

Reviewer 02Rating 4Confidence 4

Strengths

1. Applying LLM’s powerful general reasoning ability to ranking is an important and cutting-edge research direction. 2. Optimizing LLM reasoning for ranking tasks via reinforcement learning presents a promising technical approach.

Weaknesses

The discussion on model design and comparisons is insufficient. The relationship between DRanker and IRanker is unclear. Although presented as parallel methods, IRanker clearly outperforms DRanker, and all subsequent experiments are based on IRanker, making the necessity of DRanker uncertain. While the paper emphasizes reasoning, it does not clearly explain how the method improves the reasoning ability of LLMs, nor does it provide a detailed description of the modeling approach in the main text

Reviewer 03Rating 2Confidence 5

Strengths

1. The paper attempts to unify multiple ranking paradigms (recommendation, retrieval, and routing) under a reasoning-driven reinforcement learning framework. And Introducing the iterative elimination design (IRanker) is a conceptually creative attempt to bridge “reasoning steps” and “listwise ranking decisions.” 2. The use of step-wise rewards and PPO-based optimization is methodologically sound and well-aligned with reinforcement learning for language models. 3. The paper is generally well-wr

Weaknesses

1. Benchmark Coverage and Evaluation Metrics: The evaluation relies mainly on MovieLens and Amazon for recommendation and MS MARCO for retrieval, datasets with binary relevance. Missing evaluations on standard IR benchmarks such as TREC-DL, BEIR or even recent BRIGHT dataset, which are essential for validating general ranking ability. In addition, using MRR as the sole main metric is limiting; nDCG@10 and nDCG@20 are more standard and informative for graded relevance (especially in main table ra

Code & Models

Repositories

ulab-uiuc/iranker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection