Continuous-time reinforcement learning: ellipticity enables model-free value function approximation
Wenlong Mou

TL;DR
This paper introduces a model-free, off-policy reinforcement learning method for continuous-time Markov diffusions that leverages ellipticity to enable effective value function approximation without restrictive assumptions.
Contribution
It establishes new Hilbert-space properties for Bellman operators and proposes the Sobolev-prox fitted q-learning algorithm with theoretical error bounds.
Findings
Ellipticity enables positive definiteness properties for Bellman operators.
The Sobolev-prox fitted q-learning algorithm effectively learns value functions.
Theoretical bounds relate estimation error to approximation, complexity, and discretization.
Abstract
We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted -learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
