An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

Zaiwei Chen; Siva Theja Maguluri

arXiv:2208.03247·cs.LG·January 16, 2023·1 cites

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

Zaiwei Chen, Siva Theja Maguluri

PDF

Open Access

TL;DR

This paper analyzes actor-critic algorithms in reinforcement learning, showing geometric convergence of natural policy gradient and providing finite-sample guarantees for the critic, leading to an overall sample complexity bound.

Contribution

It offers a novel perspective by viewing natural policy gradient as approximate policy iteration and establishes the first overall sample complexity for policy-based methods with off-policy sampling.

Findings

01

Natural policy gradient enjoys geometric convergence with increasing stepsizes.

02

Proposed stable critic algorithms using multi-step return and importance sampling.

03

Achieved an overall (^{-2}) sample complexity for policy optimization.

Abstract

In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $λ$ -averaged $Q$ -trace…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Advancements in Semiconductor Devices and Circuit Design