Policy Optimization as Online Learning with Mediator Feedback
Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, and Marcello, Restelli

TL;DR
This paper frames policy optimization as an online learning problem with mediator feedback, introducing a new algorithm that leverages additional information to improve regret bounds and sample efficiency in continuous control tasks.
Contribution
It proposes the RANDOMIST algorithm that utilizes mediator feedback for regret minimization, extending to both finite and compact policy spaces, with theoretical and empirical validation.
Findings
Achieves constant regret in finite policy spaces under certain conditions
Attains logarithmic regret universally in policy optimization
Demonstrates superior performance over baseline methods in simulations
Abstract
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Machine Learning and Algorithms
