Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal   Algorithm Escaping the Curse of Horizon

Zihan Zhang; Xiangyang Ji; Simon S. Du

arXiv:2009.13503·cs.LG·July 1, 2021·30 cites

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Zihan Zhang, Xiangyang Ji, Simon S. Du

PDF

Open Access

TL;DR

This paper introduces a new algorithm, MVP, for episodic reinforcement learning that achieves near-optimal regret bounds, significantly reducing the complexity compared to previous methods and approaching bandit lower bounds.

Contribution

The paper presents MVP, a novel algorithm with a Bernstein-type bonus that achieves near-optimal regret in episodic RL, improving dependence on horizon, states, and actions.

Findings

01

MVP achieves $O((\sqrt{SAK} + S^2A) ext{polylog}(SAHK)$ regret.

02

MVP exponentially improves over previous algorithms in terms of horizon dependence.

03

MVP approaches the lower bound of contextual bandits in regret.

Abstract

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$ , total reward bounded by $1$ , and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms