Value Penalized Q-Learning for Recommender Systems
Chengqian Gao, Ke Xu, Kuangqi Zhou, Lanqing Li, Xueqian Wang, Bo Yuan,, Peilin Zhao

TL;DR
This paper introduces Value Penalized Q-learning (VPQ), an offline reinforcement learning algorithm designed to improve recommender systems by addressing distributional shift and uncertainty in large action spaces.
Contribution
The paper proposes VPQ, which penalizes unstable Q-values using uncertainty-aware weights, and integrates it with existing recommender system models to enhance offline RL performance.
Findings
VPQ improves recommendation quality in real-world datasets.
The method effectively mitigates distributional shift issues.
VPQ can be used as a plugin with existing recommender models.
Abstract
Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data. However, the high-dimensional action space and the non-stationary dynamics in commercial RS intensify distributional shift issues, making it challenging to apply offline RL methods to RS. To alleviate the action distribution shift problem in extracting RL policy from static trajectories, we propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm. It penalizes the unstable Q-values in the regression target by uncertainty-aware weights, without the need to estimate the behavior policy, suitable for RS with a large number of items. We derive the penalty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest · Q-Learning
