Dual Approximation Policy Optimization

Zhihan Xiong; Maryam Fazel; Lin Xiao

arXiv:2410.01249·cs.LG·October 3, 2024

Dual Approximation Policy Optimization

Zhihan Xiong, Maryam Fazel, Lin Xiao

PDF

Open Access

TL;DR

The paper introduces DAPO, a new policy optimization framework that uses dual Bregman divergence for better convergence and includes existing methods as special cases.

Contribution

It presents a novel dual approximation framework for policy optimization that guarantees fast convergence and unifies several existing methods.

Findings

01

Achieves fast linear convergence with general function approximation.

02

Includes several practical methods as special cases.

03

Provides strong theoretical convergence guarantees.

Abstract

We propose Dual Approximation Policy Optimization (DAPO), a framework that incorporates general function approximation into policy mirror descent methods. In contrast to the popular approach of using the $L_{2}$ -norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems

MethodsDialogue-Adaptive Pre-training Objective