An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders
Shuang Feng, Grace Feng

TL;DR
This paper presents a highly data-efficient RL agent for recommender systems using LLMs, achieving competitive performance with minimal training data and time, by fine-tuning pre-trained models and employing preference-based training methods.
Contribution
It introduces a low-cost, data-efficient RL training approach for recommender systems using generative trajectories and preference optimization techniques with LLMs.
Findings
Generated trajectories match human data in task performance.
DPO agent achieved 19% success rate in under 30 minutes.
Limited training time sufficed for competitive results.
Abstract
Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Direct Preference Optimization · Softmax · Linear Layer · Dropout · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Dense Connections
