Optimistic PAC Reinforcement Learning: the Instance-Dependent View
Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann

TL;DR
This paper introduces the first instance-dependent sample complexity bound for an optimistic PAC RL algorithm, BPI-UCRL, revealing near-optimality in deterministic MDPs and providing insights into the complexity differences from regret minimization.
Contribution
It provides the first instance-dependent analysis for optimistic PAC RL algorithms and introduces a new simple analysis technique called the "target trick."
Findings
BPI-UCRL achieves near-optimal sample complexity in deterministic MDPs.
The analysis introduces a refined notion of sub-optimality gap.
A hardness result explains the complexity gap between PAC RL and regret minimization.
Abstract
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics
