Learning Near Optimal Policies with Low Inherent Bellman Error
Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill

TL;DR
This paper introduces an algorithm for reinforcement learning that achieves near-optimal exploration efficiency under a broad condition called low inherent Bellman error, extending previous assumptions and matching theoretical lower bounds.
Contribution
It establishes a new, more general framework for exploration in reinforcement learning with linear function approximation and provides an optimal regret bound with a practical algorithm.
Findings
The proposed algorithm achieves near-optimal regret bounds.
The low inherent Bellman error condition is more general than low-rank MDP assumptions.
The results are tight, matching lower bounds and extending prior work.
Abstract
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound where is the horizon, is the number of episodes, is the value if the inherent Bellman error and is the feature dimension at timestep . In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
