DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward
Yi Zhang, Ruihong Qiu, Xuwei Xu, Jiajun Liu, Sen Wang

TL;DR
DARLR introduces a dual-agent offline reinforcement learning framework with dynamic reward shaping and uncertainty adaptation, significantly improving recommendation policies by addressing reward inaccuracy issues in world models.
Contribution
It proposes a novel dual-agent framework with a selector and recommender for dynamic reward estimation and uncertainty management in offline RL for recommender systems.
Findings
DARLR outperforms existing methods on four benchmark datasets.
Dynamic reward shaping improves policy accuracy.
Adaptive uncertainty penalties enhance decision risk mitigation.
Abstract
Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a \textbf{\textit{selector}} is introduced to identify reference users by balancing similarity and diversity so…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Bandit Algorithms Research
MethodsALIGN
