Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes
Runlong Zhou, Ruosong Wang, Simon S. Du

TL;DR
This paper introduces a novel reinforcement learning algorithm for Latent Markov Decision Processes that achieves nearly horizon-free regret bounds, with a focus on variance-dependent analysis and minimax optimality.
Contribution
The paper presents the first nearly horizon-free regret bounds for LMDPs, along with a variance-dependent analysis and a new lower bound demonstrating minimax optimality.
Findings
Achieves $ ilde{O}( ext{sqrt}( ext{Var}^ ext{star} M ext{Gamma} S A K))$ regret bound.
First problem-dependent regret bound for LMDPs.
Provides a novel $ ext{Omega}( ext{sqrt}( ext{Var}^ ext{star} M S A K))$ lower bound.
Abstract
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an regret bound where hides logarithm factors, is the number of contexts, is the number of states, is the number of actions, is the number of episodes, is the maximum transition degree of any state-action pair, and is a variance quantity describing the determinism of the LMDP. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. This is also the first problem-dependent regret bound for LMDP. Key in our proof is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
