Policy Testing in Markov Decision Processes
Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe

TL;DR
This paper addresses the challenge of policy testing in discounted MDPs under fixed-confidence, proposing a novel algorithm inspired by reformulating lower bounds as policy optimization in a reversed MDP.
Contribution
It introduces a new algorithm for policy testing in MDPs based on a reformulation that handles non-convex constraints via a reversed MDP perspective.
Findings
Derived an instance-dependent lower bound for policy testing in MDPs.
Proposed a new algorithm based on reformulating the problem as a policy optimization in a reversed MDP.
Showed the approach can be extended to other exploration tasks like policy evaluation.
Abstract
We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
