Bi-Level Offline Policy Optimization with Limited Exploration
Wenzhuo Zhou

TL;DR
This paper introduces a bi-level offline reinforcement learning algorithm that effectively manages distributional shift and limited exploration by modeling hierarchical policy and value function interactions, with strong theoretical guarantees.
Contribution
It proposes a novel bi-level structured policy optimization framework that handles distribution mismatch without relying on data coverage assumptions.
Findings
Achieves competitive performance on synthetic, benchmark, and real-world datasets.
Provides regret guarantees under realizability without data coverage assumptions.
Demonstrates effective control of uncertainty and distributional shift in offline RL.
Abstract
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
