RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1
Yu Xie, Xingkai Ren, Ying Qi, Yao Hu, Lianlei Shan

TL;DR
RecLLM-R1 introduces a two-stage training framework for recommendation systems that combines supervised fine-tuning and reinforcement learning with chain-of-thought reasoning, improving accuracy, diversity, and business alignment.
Contribution
The paper presents a novel two-stage training paradigm for LLM-based recommendation systems, integrating reinforcement learning with chain-of-thought to optimize multiple objectives.
Findings
Outperforms baseline methods on real-world social media data
Enhances recommendation diversity and novelty
Mitigates filter bubble effects
Abstract
Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi-faceted item attributes into LLM-interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two-stage training paradigm is employed: the initial stage involves Supervised Fine-Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
