PAC Bounds for Discounted MDPs
Tor Lattimore, Marcus Hutter

TL;DR
This paper establishes tight upper and lower bounds on the sample complexity for learning near-optimal policies in finite-state discounted MDPs, under specific transition assumptions, advancing theoretical understanding of sample efficiency.
Contribution
It introduces a new PAC bound for a UCRL-style algorithm and provides a more general, tighter lower bound applicable to all policies, with bounds matching up to logarithmic factors.
Findings
Upper bound on sample complexity under transition constraints
Lower bound applicable to all policies, tighter than previous work
Bounds match up to logarithmic factors
Abstract
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (MDPs). For the upper bound we make the assumption that each action leads to at most two possible next-states and prove a new bound for a UCRL-style algorithm on the number of time-steps when it is not Probably Approximately Correct (PAC). The new lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
