The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches
Grigoris Velegkas, Zhuoran Yang, Amin Karbasi

TL;DR
This paper presents algorithms for episodic reinforcement learning that achieve logarithmic regret and minimal policy switches in the instance-dependent setting, supported by matching lower bounds.
Contribution
It introduces RL algorithms with logarithmic regret and low switch complexity in the instance-dependent setting, extending to general function and model classes.
Findings
Regret scales logarithmically with horizon T under certain conditions.
Algorithms with O(log T) policy switches are achievable.
Lower bounds show regret cannot be better than o(log T).
Abstract
In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon , provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
