Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

TL;DR
This paper introduces TUCRL, an algorithm that efficiently balances exploration and exploitation in any finite MDP without prior knowledge, achieving near-optimal regret bounds even in complex, weakly-communicating environments.
Contribution
The paper presents TUCRL, the first algorithm capable of optimal exploration-exploitation in all finite MDPs without prior assumptions, with proven regret bounds and superior performance over existing methods.
Findings
TUCRL achieves a regret bound of O(D^C mp; \, (;\Gamma^C S^C AT) in any finite MDP.
Existing algorithms suffer linear regret in weakly-communicating MDPs or require prior knowledge.
Numerical simulations demonstrate TUCRL's effectiveness and advantages over state-of-the-art algorithms.
Abstract
While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with communicating states, actions and possible communicating next states, we derive a regret bound, where is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
