
TL;DR
This paper presents a new class of variational actor-critic algorithms that optimize both value functions and policies simultaneously, introducing variants to enhance convergence speed and analyzing their fixed points relative to optimal policies.
Contribution
It introduces variational formulations for actor-critic algorithms, along with two novel variants, and provides theoretical analysis of their convergence properties.
Findings
Variants accelerate convergence
Fixed points are near optimal policies under certain conditions
The approach unifies value and policy optimization in a variational framework
Abstract
We introduce a class of variational actor-critic algorithms based on a variational formulation over both the value function and the policy. The objective function of the variational formulation consists of two parts: one for maximizing the value function and the other for minimizing the Bellman residual. Besides the vanilla gradient descent with both the value function and the policy updates, we propose two variants, the clipping method and the flipping method, in order to speed up the convergence. We also prove that, when the prefactor of the Bellman residual is sufficiently large, the fixed point of the algorithm is close to the optimal policy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
