Global Convergence of the ODE Limit for Online Actor-Critic Algorithms in Reinforcement Learning
Ziheng Wang, Justin Sirignano

TL;DR
This paper proves that online actor-critic algorithms in reinforcement learning converge to an ODE limit, establishing their convergence to optimal policies and providing rates under specific conditions, thus advancing theoretical understanding of their behavior.
Contribution
It introduces a rigorous proof of the convergence of online actor-critic algorithms to an ODE limit, including convergence rates and conditions for practical implementation.
Findings
Convergence of the actor-critic algorithm to an ODE limit under large updates.
Proven convergence of critic to the Bellman solution and actor to the optimal policy.
Established convergence rates and conditions for learning rates and exploration.
Abstract
Actor-critic algorithms are widely used in reinforcement learning, but are challenging to mathematically analyse due to the online arrival of non-i.i.d. data samples. The distribution of the data samples dynamically changes as the model is updated, introducing a complex feedback loop between the data distribution and the reinforcement learning algorithm. We prove that, under a time rescaling, the online actor-critic algorithm with tabular parametrization converges to an ordinary differential equation (ODE) as the number of updates becomes large. The proof first establishes the geometric ergodicity of the data samples under a fixed actor policy. Then, using a Poisson equation, we prove that the fluctuations of the data samples around a dynamic probability measure, which is a function of the evolving actor model, vanish as the number of updates become large. Once the ODE limit has been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
