Distillation Policy Optimization

Jianfei Ma

arXiv:2302.00533·cs.LG·September 28, 2023·1 cites

Distillation Policy Optimization

Jianfei Ma

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new actor-critic framework that combines on-policy and off-policy data sources, improving sample efficiency and stability in reinforcement learning.

Contribution

It presents a novel framework with variance reduction techniques that enhances sample efficiency and stability, bridging the gap between on-policy and off-policy methods.

Findings

01

Significant improvements in sample efficiency for on-policy algorithms.

02

Effective variance reduction via UAE and residual baseline.

03

Bridges the gap between on-policy stability and off-policy efficiency.

Abstract

While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

magifeeney/dpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Energy Management · Age of Information Optimization · Optimization and Search Problems