Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon
Zihan Zhang, Xiangyang Ji, Simon S. Du

TL;DR
This paper introduces a new algorithm, MVP, for episodic reinforcement learning that achieves near-optimal regret bounds, significantly reducing the complexity compared to previous methods and approaching bandit lower bounds.
Contribution
The paper presents MVP, a novel algorithm with a Bernstein-type bonus that achieves near-optimal regret in episodic RL, improving dependence on horizon, states, and actions.
Findings
MVP achieves $O((\sqrt{SAK} + S^2A) ext{polylog}(SAHK)$ regret.
MVP exponentially improves over previous algorithms in terms of horizon dependence.
MVP approaches the lower bound of contextual bandits in regret.
Abstract
Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with states, actions, planning horizon , total reward bounded by , and the agent plays for episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
