A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
Heyang Zhao, Jiafan He, Quanquan Gu

TL;DR
This paper introduces MQL-UCB, a reinforcement learning algorithm that achieves near-optimal regret and low policy switching costs using a novel deterministic switching strategy, monotonic value functions, and variance-weighted regression.
Contribution
The paper presents a new RL algorithm with a deterministic policy-switching strategy, monotonic value functions, and variance-weighted regression, achieving minimax optimal regret and low switching costs.
Findings
Achieves minimax optimal regret of O(d\u221A HK)
Near-optimal policy switching cost of O(dH)
Effective for general function approximation with provable guarantees
Abstract
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of when is sufficiently large and near-optimal policy switching cost of , with being the eluder dimension of the function class, being the planning horizon, and being the number of episodes. Our work…
Peer Reviews
Decision·NeurIPS 2024 poster
The paper is very well-written, and the main results of the paper are of high quality. The authors improved prior works on RL with general functional approximations to get an algorithm that achieves both the near-optimal regret and the lowest possible switching cost when the number of episodes is large. Besides, the proposed algorithm is intuitive and clean, and the proofs are well-written and easy to follow.
It would be helpful to provide a bit more details for readers (like me) who are not very familiar with the literature on RL with policy switching cost on how to obtain the counting of switches in Table 1. For example, [1] doesn't seem to optimize the number of policy switches (while only trying to optimize the regret); since the paper is directly improving upon [1], it would be helpful to provide a more detailed discussion on why [1]'s algorithm needs $\tilde{O}(dim(\mathcal{F})^2H)$ number of s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
MethodsQ-Learning
