Bi-Level Offline Policy Optimization with Limited Exploration

Wenzhuo Zhou

arXiv:2310.06268·cs.LG·October 11, 2023·1 cites

Bi-Level Offline Policy Optimization with Limited Exploration

Wenzhuo Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces a bi-level offline reinforcement learning algorithm that effectively manages distributional shift and limited exploration by modeling hierarchical policy and value function interactions, with strong theoretical guarantees.

Contribution

It proposes a novel bi-level structured policy optimization framework that handles distribution mismatch without relying on data coverage assumptions.

Findings

01

Achieves competitive performance on synthetic, benchmark, and real-world datasets.

02

Provides regret guarantees under realizability without data coverage assumptions.

03

Demonstrates effective control of uncertainty and distributional shift in offline RL.

Abstract

We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Bi-Level Offline Policy Optimization with Limited Exploration· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research