Best Policy Identification in Linear MDPs
Jerome Taupin, Yassir Jedra, Alexandre Proutiere

TL;DR
This paper studies the problem of efficiently identifying the best policy in linear Markov Decision Processes using sample-efficient algorithms, providing theoretical bounds and extending to episodic settings.
Contribution
It derives an instance-specific lower bound and proposes near-optimal algorithms with proven sample complexity bounds for linear MDPs.
Findings
Sample complexity upper bound of ${rac{d}{( ext{gap})^2}}$ times logarithmic factors.
Algorithm matches existing lower bounds in the moderate-confidence regime.
Extension of algorithms to episodic linear MDPs.
Abstract
We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an -optimal policy with probability . The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by where denotes the minimum reward gap of sub-optimal actions and is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms
