Regret of exploratory policy improvement and $q$-learning

Wenpin Tang; Xun Yu Zhou

arXiv:2411.01302·cs.LG·November 5, 2024

Regret of exploratory policy improvement and $q$-learning

Wenpin Tang, Xun Yu Zhou

PDF

Open Access

TL;DR

This paper analyzes the convergence, error bounds, and regret of $q$-learning and exploratory policy improvement algorithms for controlled diffusion processes, providing theoretical insights into their performance under certain conditions.

Contribution

It offers the first quantitative error and regret analysis for these algorithms in the context of controlled diffusion processes, extending previous work.

Findings

01

Provides explicit error bounds for $q$-learning.

02

Quantifies regret for exploratory policy improvement.

03

Establishes convergence conditions for the algorithms.

Abstract

We study the convergence of $q$ -learning and related algorithms introduced by Jia and Zhou (J. Mach. Learn. Res., 24 (2023), 161) for controlled diffusion processes. Under suitable conditions on the growth and regularity of the model parameters, we provide a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the $q$ -learning algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference

MethodsDiffusion