Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Yaochen Zhu; Harald Steck; Dawen Liang; Yinhan He; Vito Ostuni; Jundong Li; Nathan Kallus

arXiv:2510.20150·cs.IR·February 17, 2026

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, Nathan Kallus

PDF

3 Reviews

TL;DR

This paper introduces ConvRec-R1, a two-stage framework for training conversational recommender systems with LLMs, improving ranking quality and output accuracy through reinforcement learning and a novel rank-based optimization method.

Contribution

It presents Rank-GRPO, a new reinforcement learning algorithm that optimizes rank-style outputs in LLM-based recommender systems, and a data construction pipeline for better training initialization.

Findings

01

Faster convergence and higher recall and NDCG on Reddit-v2 dataset.

02

Effective handling of rank-style outputs with the proposed Rank-GRPO.

03

Improved alignment of LLMs to recommendation tasks.

Abstract

Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. Rank-GRPO is a main contribution. The manuscript clearly identifies the core weaknesses of applying standard GRPO to ranking tasks and proposes an new solution by re-framing the problem at the rank level. The technical solutions are well-motivated and supported by theoretical analysis. 2. The "Remap-Reflect-Adjust" pipeline is a significant engineering contribution that provides a sophisticated solution to the critical data scarcity problem in this domain. It is quite practical for researcher

Weaknesses

1. The manuscript utilizes an “LLM as a judge” in its data pipeline but does not adequately discuss or account for known biases of this paradigm, such as position or verbosity bias, which could affect the quality of the SFT dataset. 2. All experiments are conducted on a single dataset in the movie domain. The manuscript would be stronger with a discussion on the potential challenges of applying the framework to other domains like e-commerce or music. 3. The manuscript notes that the model's outp

Reviewer 02Rating 4Confidence 3

Strengths

The paper is well-motivated. The paper is readable.

Weaknesses

1. **Performance on larger models (e.g., 7B) is unclear.** Please provide experimental results or discussions on how the proposed method scales to larger backbones (e.g., 7B parameters). This will help verify whether the observed improvements generalize across model sizes. 2. **Baselines in Table 1 are insufficient.** Table 1 should include more **post-training baselines** specific to LLM-based recommender systems, rather than comparing only SFT or SFT + GRPO. Incorporatin

Reviewer 03Rating 4Confidence 3

Strengths

1) The paper addresses an important and emerging problem i.e aligning LLMs for conversational recommendation, which has practical utility for the industry and is a relevant problem for the community. 2) The authors provide code, data, and detailed implementation notes, making the work easy to reproduce and build upon. 3) The proposed rank-level GRPO is a generally useful method for ranking tasks. It's well motivated, includes gradient analysis, and performs reasonably well in practice. 4) The ex

Weaknesses

1) The core ideas, namely supervised fine-tuning plus RL alignment, primarily extend existing GRPO and RLHF frameworks, without introducing a fundamentally new paradigm. 2) The approach is tightly focused on conversational recommendation and may not generalize well to broader LLM alignment or other ranking tasks, which could limit the paper's impact. 3) The performance improvement over strong prompting baselines (e.g., CRAG) is modest given the added training complexity, and under off-policy set

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.