PAC Bounds for Discounted MDPs

Tor Lattimore; Marcus Hutter

arXiv:1202.3890·cs.LG·May 17, 2013

PAC Bounds for Discounted MDPs

Tor Lattimore, Marcus Hutter

PDF

Open Access

TL;DR

This paper establishes tight upper and lower bounds on the sample complexity for learning near-optimal policies in finite-state discounted MDPs, under specific transition assumptions, advancing theoretical understanding of sample efficiency.

Contribution

It introduces a new PAC bound for a UCRL-style algorithm and provides a more general, tighter lower bound applicable to all policies, with bounds matching up to logarithmic factors.

Findings

01

Upper bound on sample complexity under transition constraints

02

Lower bound applicable to all policies, tighter than previous work

03

Bounds match up to logarithmic factors

Abstract

We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (MDPs). For the upper bound we make the assumption that each action leads to at most two possible next-states and prove a new bound for a UCRL-style algorithm on the number of time-steps when it is not Probably Approximately Correct (PAC). The new lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms