Value Function Decomposition in Markov Recommendation Process
Xiaobei Wang, Shuchang Liu, Qingpeng Cai, Xiang Li, Lantao Hu, Han li,, Guangming Xie

TL;DR
This paper proposes a value function decomposition method for reinforcement learning in recommender systems, improving long-term reward estimation, learning speed, and robustness by disentangling stochastic policy and environment factors.
Contribution
It introduces a novel disentangled learning framework that separately models stochastic policy and user environment factors in value function estimation for better recommendation performance.
Findings
Faster convergence in value estimation
Enhanced robustness against action exploration
Improved long-term reward prediction accuracy
Abstract
Recent advances in recommender systems have shown that user-system interaction essentially formulates long-term optimization problems, and online reinforcement learning can be adopted to improve recommendation performance. The general solution framework incorporates a value function that estimates the user's expected cumulative rewards in the future and guides the training of the recommendation policy. To avoid local maxima, the policy may explore potential high-quality actions during inference to increase the chance of finding better future rewards. To accommodate the stepwise recommendation process, one widely adopted approach to learning the value function is learning from the difference between the values of two consecutive states of a user. However, we argue that this paradigm involves a challenge of Mixing Random Factors: there exist two random factors from the stochastic policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
