TL;DR
This paper introduces ConvRec-R1, a two-stage framework for training conversational recommender systems with LLMs, improving ranking quality and output accuracy through reinforcement learning and a novel rank-based optimization method.
Contribution
It presents Rank-GRPO, a new reinforcement learning algorithm that optimizes rank-style outputs in LLM-based recommender systems, and a data construction pipeline for better training initialization.
Findings
Faster convergence and higher recall and NDCG on Reddit-v2 dataset.
Effective handling of rank-style outputs with the proposed Rank-GRPO.
Improved alignment of LLMs to recommendation tasks.
Abstract
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO…
Peer Reviews
Decision·ICLR 2026 Poster
1. Rank-GRPO is a main contribution. The manuscript clearly identifies the core weaknesses of applying standard GRPO to ranking tasks and proposes an new solution by re-framing the problem at the rank level. The technical solutions are well-motivated and supported by theoretical analysis. 2. The "Remap-Reflect-Adjust" pipeline is a significant engineering contribution that provides a sophisticated solution to the critical data scarcity problem in this domain. It is quite practical for researcher
1. The manuscript utilizes an “LLM as a judge” in its data pipeline but does not adequately discuss or account for known biases of this paradigm, such as position or verbosity bias, which could affect the quality of the SFT dataset. 2. All experiments are conducted on a single dataset in the movie domain. The manuscript would be stronger with a discussion on the potential challenges of applying the framework to other domains like e-commerce or music. 3. The manuscript notes that the model's outp
The paper is well-motivated. The paper is readable.
1. **Performance on larger models (e.g., 7B) is unclear.** Please provide experimental results or discussions on how the proposed method scales to larger backbones (e.g., 7B parameters). This will help verify whether the observed improvements generalize across model sizes. 2. **Baselines in Table 1 are insufficient.** Table 1 should include more **post-training baselines** specific to LLM-based recommender systems, rather than comparing only SFT or SFT + GRPO. Incorporatin
1) The paper addresses an important and emerging problem i.e aligning LLMs for conversational recommendation, which has practical utility for the industry and is a relevant problem for the community. 2) The authors provide code, data, and detailed implementation notes, making the work easy to reproduce and build upon. 3) The proposed rank-level GRPO is a generally useful method for ranking tasks. It's well motivated, includes gradient analysis, and performs reasonably well in practice. 4) The ex
1) The core ideas, namely supervised fine-tuning plus RL alignment, primarily extend existing GRPO and RLHF frameworks, without introducing a fundamentally new paradigm. 2) The approach is tightly focused on conversational recommendation and may not generalize well to broader LLM alignment or other ranking tasks, which could limit the paper's impact. 3) The performance improvement over strong prompting baselines (e.g., CRAG) is modest given the added training complexity, and under off-policy set
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
