Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette; Alessandro Lazaric; Mykel Kochenderfer; Emma Brunskill

arXiv:2003.00153·cs.LG·June 30, 2020·38 cites

Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill

PDF

Open Access 1 Video

TL;DR

This paper introduces an algorithm for reinforcement learning that achieves near-optimal exploration efficiency under a broad condition called low inherent Bellman error, extending previous assumptions and matching theoretical lower bounds.

Contribution

It establishes a new, more general framework for exploration in reinforcement learning with linear function approximation and provides an optimal regret bound with a practical algorithm.

Findings

01

The proposed algorithm achieves near-optimal regret bounds.

02

The low inherent Bellman error condition is more general than low-rank MDP assumptions.

03

The results are tight, matching lower bounds and extending prior work.

Abstract

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $O (\sum_{t = 1}^{H} d_{t} K + \sum_{t = 1}^{H} d_{t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_{t}$ is the feature dimension at timestep $t$ . In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Near Optimal Policies with Low Inherent Bellman Error· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management