Best Policy Identification in Linear MDPs

Jerome Taupin; Yassir Jedra; Alexandre Proutiere

arXiv:2208.05633·cs.LG·August 12, 2022

Best Policy Identification in Linear MDPs

Jerome Taupin, Yassir Jedra, Alexandre Proutiere

PDF

Open Access

TL;DR

This paper studies the problem of efficiently identifying the best policy in linear Markov Decision Processes using sample-efficient algorithms, providing theoretical bounds and extending to episodic settings.

Contribution

It derives an instance-specific lower bound and proposes near-optimal algorithms with proven sample complexity bounds for linear MDPs.

Findings

01

Sample complexity upper bound of ${rac{d}{( ext{gap})^2}}$ times logarithmic factors.

02

Algorithm matches existing lower bounds in the moderate-confidence regime.

03

Extension of algorithms to episodic linear MDPs.

Abstract

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $ε$ -optimal policy with probability $1 - δ$ . The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by $O (\frac{d}{( ε + Δ ) ^{2}} (lo g (\frac{1}{δ}) + d))$ where $Δ$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms