Multi-Task Off-Policy Learning from Bandit Feedback

Joey Hong; Branislav Kveton; Sumeet Katariya; Manzil Zaheer; and Mohammad Ghavamzadeh

arXiv:2212.04720·cs.LG·December 12, 2022·1 cites

Multi-Task Off-Policy Learning from Bandit Feedback

Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, and Mohammad Ghavamzadeh

PDF

Open Access 1 Video

TL;DR

This paper introduces HierOPO, a hierarchical off-policy optimization method for multi-task bandit problems, demonstrating improved policy learning by leveraging shared structure among tasks through theoretical bounds and empirical evaluation.

Contribution

The paper presents HierOPO, a novel hierarchical off-policy optimization algorithm for multi-task bandit feedback, with theoretical guarantees and efficient implementation for linear Gaussian models.

Findings

01

HierOPO outperforms non-hierarchical methods in policy quality.

02

Theoretical bounds show improved suboptimality with hierarchy.

03

Empirical results confirm the advantage of shared structure in tasks.

Abstract

Many practical applications, such as recommender systems and learning to rank, involve solving multiple similar tasks. One example is learning of recommendation policies for users with similar movie preferences, where the users may still rank the individual movies slightly differently. Such tasks can be organized in a hierarchy, where similar tasks are related through a shared structure. In this work, we formulate this problem as a contextual off-policy optimization in a hierarchical graphical model from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We instantiate HierOPO in linear Gaussian models, for which we also provide an efficient implementation and analysis. We prove per-task bounds on the suboptimality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-Task Off-Policy Learning from Bandit Feedback· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Smart Grid Energy Management