Regret of exploratory policy improvement and $q$-learning
Wenpin Tang, Xun Yu Zhou

TL;DR
This paper analyzes the convergence, error bounds, and regret of $q$-learning and exploratory policy improvement algorithms for controlled diffusion processes, providing theoretical insights into their performance under certain conditions.
Contribution
It offers the first quantitative error and regret analysis for these algorithms in the context of controlled diffusion processes, extending previous work.
Findings
Provides explicit error bounds for $q$-learning.
Quantifies regret for exploratory policy improvement.
Establishes convergence conditions for the algorithms.
Abstract
We study the convergence of -learning and related algorithms introduced by Jia and Zhou (J. Mach. Learn. Res., 24 (2023), 161) for controlled diffusion processes. Under suitable conditions on the growth and regularity of the model parameters, we provide a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the -learning algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
MethodsDiffusion
