Policy Optimization as Online Learning with Mediator Feedback

Alberto Maria Metelli; Matteo Papini; Pierluca D'Oro; and Marcello; Restelli

arXiv:2012.08225·cs.LG·December 16, 2020

Policy Optimization as Online Learning with Mediator Feedback

Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, and Marcello, Restelli

PDF

Open Access

TL;DR

This paper frames policy optimization as an online learning problem with mediator feedback, introducing a new algorithm that leverages additional information to improve regret bounds and sample efficiency in continuous control tasks.

Contribution

It proposes the RANDOMIST algorithm that utilizes mediator feedback for regret minimization, extending to both finite and compact policy spaces, with theoretical and empirical validation.

Findings

01

Achieves constant regret in finite policy spaces under certain conditions

02

Attains logarithmic regret universally in policy optimization

03

Demonstrates superior performance over baseline methods in simulations

Abstract

Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Machine Learning and Algorithms