Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach

Sebastian Reboul; H\'el\`ene Halconruy; Randal Douc

arXiv:2510.19528·stat.ML·October 23, 2025

Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach

Sebastian Reboul, H\'el\`ene Halconruy, Randal Douc

PDF

Open Access

TL;DR

This paper presents a new two-stage framework for offline-to-online reinforcement learning that learns data-driven value envelopes to improve regret bounds and accelerate online adaptation.

Contribution

The paper introduces a principled method to learn and incorporate value envelopes in online RL, extending prior work with decoupled bounds and a formal regret analysis.

Findings

01

Substantial regret reduction in tabular MDPs

02

Data-driven value envelopes improve online RL performance

03

Theoretical regret bounds linked to offline data quality

Abstract

We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques