Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno (Thoth; EDF R&D; FiME Lab); Khaled Eldowa (UNIMI; POLIMI); Pierre Gaillard (Thoth); Margaux Br\'eg\`ere (EDF R&D; LPSM); Nadia Oudjane (EDF R&D; FiME Lab)

arXiv:2505.07303·cs.LG·May 13, 2025

Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno (Thoth, EDF R&D, FiME Lab), Khaled Eldowa (UNIMI, POLIMI), Pierre Gaillard (Thoth), Margaux Br\'eg\`ere (EDF R&D, LPSM), Nadia Oudjane (EDF R&D, FiME Lab)

PDF

Open Access

TL;DR

This paper introduces new algorithms for online convex reinforcement learning in episodic MDPs, achieving near-optimal regret bounds without prior knowledge and extending to bandit feedback scenarios.

Contribution

The paper presents the first algorithms with regret guarantees for online CURL in MDPs, including a bandit setting, using online mirror descent and bandit convex optimization techniques.

Findings

01

Achieved near-optimal regret bounds for online CURL.

02

Extended algorithms to bandit feedback with sub-linear regret.

03

Developed exploration strategies for convex loss functions in MDPs.

Abstract

We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Distributed Control Multi-Agent Systems