Pessimistic Risk-Aware Policy Learning in Contextual Bandits
Yilong Wan, Yuqiang Li, Xianyi Wu

TL;DR
This paper introduces a unified framework for risk-aware offline policy learning in contextual bandits, enabling the optimization of various risk measures with optimal statistical guarantees.
Contribution
It develops a distributional approach for optimizing Lipschitz-continuous risk functionals, providing minimax optimal bounds without restrictive assumptions.
Findings
Achieves an $ ilde{ ext{O}}(1/ ext{sqrt}(n))$ convergence rate for risk functional optimization.
Provides data-dependent suboptimality bounds for importance sampling estimators.
Shows no additional statistical cost compared to risk-neutral policy optimization.
Abstract
We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
