Online Episodic Convex Reinforcement Learning
Bianca Marin Moreno (Thoth, EDF R&D, FiME Lab), Khaled Eldowa (UNIMI, POLIMI), Pierre Gaillard (Thoth), Margaux Br\'eg\`ere (EDF R&D, LPSM), Nadia Oudjane (EDF R&D, FiME Lab)

TL;DR
This paper introduces new algorithms for online convex reinforcement learning in episodic MDPs, achieving near-optimal regret bounds without prior knowledge and extending to bandit feedback scenarios.
Contribution
The paper presents the first algorithms with regret guarantees for online CURL in MDPs, including a bandit setting, using online mirror descent and bandit convex optimization techniques.
Findings
Achieved near-optimal regret bounds for online CURL.
Extended algorithms to bandit feedback with sub-linear regret.
Developed exploration strategies for convex loss functions in MDPs.
Abstract
We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Distributed Control Multi-Agent Systems
