Pessimistic Nonlinear Least-Squares Value Iteration for Offline   Reinforcement Learning

Qiwei Di; Heyang Zhao; Jiafan He; Quanquan Gu

arXiv:2310.01380·cs.LG·October 10, 2024

Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Qiwei Di, Heyang Zhao, Jiafan He, Quanquan Gu

PDF

Open Access 1 Video

TL;DR

This paper introduces PNLSVI, an offline RL algorithm with non-linear function approximation that achieves minimax optimal regret bounds, extending prior results from linear to more general function classes.

Contribution

The paper proposes a novel, oracle-efficient offline RL algorithm with non-linear function approximation, featuring a variance-based regression, variance estimation, and pessimistic value iteration.

Findings

01

Achieves minimax optimal instance-dependent regret for linear functions.

02

Extends regret guarantees to general non-linear function classes.

03

Provides a versatile framework applicable to various function approximations.

Abstract

Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results achieved under certain assumptions, many works shift their interest to offline RL with non-linear function approximation. However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees. In this paper, we propose an oracle-efficient algorithm, dubbed Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linear function approximation. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · COVID-19 epidemiological studies · Reinforcement Learning in Robotics