Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness
Haoyu Wei

TL;DR
This paper develops a semiparametric inference method for evaluating optimal policies in Markov decision processes, addressing challenges of non-uniqueness and non-regularity, with theoretical guarantees and practical applications.
Contribution
It introduces NSAVE, a novel semiparametric method achieving efficiency and robustness for off-policy evaluation of optimal policies, even under non-uniqueness.
Findings
NSAVE achieves semiparametric efficiency.
Method remains stable in degenerate regimes.
Application provides patient-specific confidence intervals.
Abstract
Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSupply Chain and Inventory Management
