User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
Xingyuan Xiang, Xiangchen Pan, Wei Wei

TL;DR
This paper introduces SMTPO, a framework that uses user simulators and reinforcement learning to improve multi-turn conversational recommendations by better aligning simulated feedback with true user preferences.
Contribution
The paper proposes a novel multi-turn preference optimization method that enhances feedback quality and aligns recommendations with true user preferences without explicit labels.
Findings
SMTPO improves recommendation accuracy in multi-turn conversations.
Enhanced feedback quality via multi-task fine-tuning leads to better preference modeling.
Reinforcement learning with reward design effectively aligns simulated feedback with true preferences.
Abstract
Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information-scarce dialogue histories and single-turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM-based user simulators, which generate natural language feedback and perform simulated multi-turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi-step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
