Near-Optimal Sample Complexity for MDPs via Anchoring
Jongmin Lee, Mario Bravo, Roberto Cominetti

TL;DR
This paper introduces a new model-free algorithm for average reward MDPs that achieves near-optimal sample complexity without prior knowledge of certain parameters, using anchored iteration and recursive sampling.
Contribution
It presents the first model-free algorithm with near-optimal sample complexity for average reward MDPs that does not require prior knowledge of the bias span.
Findings
Achieves sample complexity $ ilde{O}(| ext{S}|| ext{A}| ext{sp}(h^*)^2/ ext{ε}^2)$ matching lower bounds
Requires no prior knowledge of the bias span and guarantees finite termination
Extends techniques to discounted MDPs
Abstract
We study a new model-free algorithm to compute -optimal policies for average reward Markov decision processes, in the weakly communicating case. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an -optimal policy with sample and time complexity both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor . Although the complexity bound involves the span seminorm of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Water Systems and Optimization · Sparse and Compressive Sensing Techniques
