No-Regret Reinforcement Learning in Smooth MDPs
Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restell

TL;DR
This paper introduces a new smoothness assumption for MDPs and proposes two algorithms that achieve the best regret guarantees in reinforcement learning with continuous state and action spaces.
Contribution
The paper proposes a novel $ u$-smoothness assumption for MDPs and develops two algorithms with improved regret guarantees for continuous RL problems.
Findings
Both algorithms achieve the best regret guarantees compared to state-of-the-art.
Legendre-Eleanor is more general but computationally inefficient.
Legendre-LSVI is computationally efficient but applies to a smaller class of problems.
Abstract
Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, \textsc{Legendre-Eleanor}, archives the no-regret property under weaker assumptions but is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
