Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits
Yi Shen, Pan Xu, Michael M. Zavlanos

TL;DR
This paper introduces a Wasserstein-based distributionally robust policy evaluation and learning method for contextual bandits, addressing environment mismatch issues more effectively than traditional KL-based approaches.
Contribution
It proposes a novel Wasserstein DRO framework with efficient optimization techniques and theoretical guarantees, improving robustness in off-policy evaluation and learning.
Findings
Wasserstein DRO outperforms KL-based methods in environment mismatch scenarios.
The proposed method achieves competitive policy evaluation accuracy.
Theoretical analysis confirms finite sample and iteration complexity bounds.
Abstract
Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Portfolio Optimization · Domain Adaptation and Few-Shot Learning
