A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao; Jiafan He; Quanquan Gu

arXiv:2311.15238·cs.LG·October 6, 2025·1 cites

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao, Jiafan He, Quanquan Gu

PDF

Open Access 1 Reviews

TL;DR

This paper introduces MQL-UCB, a reinforcement learning algorithm that achieves near-optimal regret and low policy switching costs using a novel deterministic switching strategy, monotonic value functions, and variance-weighted regression.

Contribution

The paper presents a new RL algorithm with a deterministic policy-switching strategy, monotonic value functions, and variance-weighted regression, achieving minimax optimal regret and low switching costs.

Findings

01

Achieves minimax optimal regret of O(d\u221A HK)

02

Near-optimal policy switching cost of O(dH)

03

Effective for general function approximation with provable guarantees

Abstract

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O} (d H K)$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O} (d H)$ , with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper is very well-written, and the main results of the paper are of high quality. The authors improved prior works on RL with general functional approximations to get an algorithm that achieves both the near-optimal regret and the lowest possible switching cost when the number of episodes is large. Besides, the proposed algorithm is intuitive and clean, and the proofs are well-written and easy to follow.

Weaknesses

It would be helpful to provide a bit more details for readers (like me) who are not very familiar with the literature on RL with policy switching cost on how to obtain the counting of switches in Table 1. For example, [1] doesn't seem to optimize the number of policy switches (while only trying to optimize the regret); since the paper is directly improving upon [1], it would be helpful to provide a more detailed discussion on why [1]'s algorithm needs $\tilde{O}(dim(\mathcal{F})^2H)$ number of s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems

MethodsQ-Learning