A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

Andris Ambainis; Joao F. Doriguello; Debbie Lim

arXiv:2507.22854·cs.LG·August 12, 2025

A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

Andris Ambainis, Joao F. Doriguello, Debbie Lim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces classical and quantum algorithms for reinforcement learning in MDPs that leverage a generative model, achieving improved regret bounds and breaking classical time dependence barriers.

Contribution

It presents novel quantum algorithms for RL in MDPs that avoid traditional paradigms and achieve superior regret bounds, especially logarithmic dependence on time for finite-horizon cases.

Findings

01

Quantum algorithms achieve logarithmic regret dependence on T for finite-horizon MDPs.

02

Classical and quantum algorithms improve regret bounds for infinite-horizon MDPs.

03

Quantum algorithms outperform classical ones in terms of parameter dependence and regret.

Abstract

We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a "simulator". By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like "optimism in the face of uncertainty" and "posterior sampling" and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The application of quantum computing (e.g., the quantum mean estimation subroutine) to online exploration of RL is an interesting and important problem. This paper pushes the boundary of this problem. 2. As far as I know, this is the first paper investigating quantum RL in the setting of infinite-horizon average-reward MDP. It shows that the sample complexity can also be improved quadratically with the help of quantum mean estimation. 3. The idea of applying quantum maximum finding subrouti

Weaknesses

1. There is a major problem in the formulation of this "exploration-generation" two-phase procedure, which assumes that the agent can use the oracles as *a generative model* in the generation phase *without incurring any regret*. There are indeed many works in the literature of classical RL using this idea of "lazy update" to design sample-efficient algorithms such as [1, 2], but none of these works assume the access to a generative model nor assume the data collection phase incurs no regret. Th

Reviewer 02Rating 4Confidence 3

Strengths

The paper is well-written and engaging. The discussed related work is extensive and provides a good overview on similar research.

Weaknesses

However, the applicability of the proposed algorithms is strictly confined to RL scenarios where a generative model with comprehensive knowledge of the state and action spaces—as well as the reward function—is available. Moreover, it assumes the transition probabilities can be queried through an oracle. Such oracles are not novel in RL research and are frequently associated with methods categorized as model-based RL. Past research has already explored quantum oracles extensively, particularly fo

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper provides solid theoretical proofs for its results. 2. For the online learning problem of MDPs, when we introduce quantum algorithm, it is difficult to define proper regret. The authors introduce a novel model with classical exploration phases and classical/quantum generative phases to solve this difficulty.

Weaknesses

1. In section 2 (to compute optimal policies), the authors consider undiscounted version of MDPs (both finite-horizon and infinite-horizon). In my understanding, the discounted version is more important and the quantum algorithm in this version has already been proposed. The motivation and technique challenges of the undiscounted version are not clearly explained. 2. In section 3 (online learning version), the authors introduce a novel model which splits the interaction into two types of phases

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum Computing Algorithms and Architecture · Advanced Bandit Algorithms Research · Quantum many-body systems