Improved Bounds for Reward-Agnostic and Reward-Free Exploration
Oran Ridel, Alon Cohen

TL;DR
This paper introduces a new algorithm for reward-free and reward-agnostic exploration in finite-horizon MDPs, relaxing accuracy constraints and establishing tight bounds.
Contribution
It proposes a novel exploration algorithm that reduces the accuracy requirement and provides a tight lower bound for reward-free exploration.
Findings
The new algorithm significantly relaxes the epsilon requirement in reward-agnostic exploration.
A tight lower bound for reward-free exploration is established, closing previous gaps.
The approach employs an online learning procedure with designed rewards for effective exploration.
Abstract
We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable -optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets -optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter . We propose a new algorithm that significantly relaxes the requirement on . Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
