Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
Sebastian Reboul, H\'el\`ene Halconruy, Randal Douc

TL;DR
This paper presents a new two-stage framework for offline-to-online reinforcement learning that learns data-driven value envelopes to improve regret bounds and accelerate online adaptation.
Contribution
The paper introduces a principled method to learn and incorporate value envelopes in online RL, extending prior work with decoupled bounds and a formal regret analysis.
Findings
Substantial regret reduction in tabular MDPs
Data-driven value envelopes improve online RL performance
Theoretical regret bounds linked to offline data quality
Abstract
We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques
