Offline Policy Optimization with Eligible Actions
Yao Liu, Yannis Flet-Berliac, Emma Brunskill

TL;DR
This paper addresses overfitting in offline policy optimization by introducing a normalization constraint, demonstrating improved performance and reduced overfitting in healthcare and control tasks.
Contribution
It proposes a novel per-state-neighborhood normalization algorithm to mitigate overfitting in importance-weighted offline policy optimization, with theoretical and empirical validation.
Findings
Reduced overfitting in policy learning
Improved test performance over existing methods
Effective in healthcare and control environments
Abstract
Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Machine Learning and Data Classification
MethodsTest
