Near Optimal Exploration-Exploitation in Non-Communicating Markov   Decision Processes

Ronan Fruit; Matteo Pirotta; Alessandro Lazaric

arXiv:1807.02373·cs.LG·March 21, 2019·58 cites

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

PDF

Open Access 1 Repo

TL;DR

This paper introduces TUCRL, an algorithm that efficiently balances exploration and exploitation in any finite MDP without prior knowledge, achieving near-optimal regret bounds even in complex, weakly-communicating environments.

Contribution

The paper presents TUCRL, the first algorithm capable of optimal exploration-exploitation in all finite MDPs without prior assumptions, with proven regret bounds and superior performance over existing methods.

Findings

01

TUCRL achieves a regret bound of O(D^C mp; \, (;\Gamma^C S^C AT) in any finite MDP.

02

Existing algorithms suffer linear regret in weakly-communicating MDPs or require prior knowledge.

03

Numerical simulations demonstrate TUCRL's effectiveness and advantages over state-of-the-art algorithms.

Abstract

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^{C}$ communicating states, $A$ actions and $Γ^{C} \leq S^{C}$ possible communicating next states, we derive a $O (D^{C} Γ^{C} S^{C} A T)$ regret bound, where $D^{C}$ is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RonanFR/UCRL
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings