Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba

TL;DR
This paper introduces a Bayesian framework for off-policy evaluation and learning in large action spaces, effectively capturing action correlations to improve sample efficiency and performance.
Contribution
It proposes a unified Bayesian approach, sDM, that leverages action correlations without sacrificing computational efficiency, and introduces Bayesian metrics for algorithm assessment.
Findings
sDM outperforms existing methods in empirical tests
Leveraging action correlations improves evaluation accuracy
Bayesian metrics provide better performance insights
Abstract
In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Context-Aware Activity Recognition Systems
