Reinforcement Learning for Slate-based Recommender Systems: A Tractable   Decomposition and Practical Methodology

Eugene Ie; Vihan Jain; Jing Wang; Sanmit Narvekar; Ritesh Agarwal; Rui; Wu; Heng-Tze Cheng; Morgane Lustman; Vince Gatto; Paul Covington; Jim; McFadden; Tushar Chandra; Craig Boutilier

arXiv:1905.12767·cs.LG·June 3, 2019·24 cites

Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui, Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim, McFadden, Tushar Chandra, Craig Boutilier

PDF

Open Access 3 Repos

TL;DR

This paper introduces SLATEQ, a novel RL-based approach for slate recommendation systems that decomposes long-term value to enable scalable, long-term optimized recommendations, validated through simulations and live YouTube experiments.

Contribution

The paper presents a new decomposition method for RL in slate recommendations, enabling tractable long-term value optimization and practical implementation.

Findings

01

SLATEQ effectively decomposes slate value, making RL scalable.

02

The methodology leverages existing recommenders for long-term optimization.

03

Live experiments on YouTube validate the approach's scalability and effectiveness.

Abstract

Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items - which may have interacting effects on user choice - methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SLATEQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Advanced Bandit Algorithms Research · Smart Grid Energy Management

MethodsQ-Learning