Policy Gradient with Active Importance Sampling
Matteo Papini, Giorgio Manganini, Alberto Maria Metelli, Marcello, Restelli

TL;DR
This paper introduces an active importance sampling approach for policy gradient methods in reinforcement learning, optimizing behavioral policies to minimize variance and improve learning efficiency.
Contribution
It proposes an iterative algorithm that optimizes behavioral policies for variance reduction using defensive importance sampling, with theoretical convergence analysis and practical validation.
Findings
Reduced policy gradient variance compared to standard methods
Faster learning speed in reinforcement learning tasks
Theoretical convergence rate of the proposed algorithm
Abstract
Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of previously collected samples, thus increasing sample efficiency. However, classically, IS is employed in RL as a passive tool for re-weighting historical samples. However, the statistical community employs IS as an active tool combined with the use of behavioral distributions that allow the reduction of the estimate variance even below the sample mean one. In this paper, we focus on this second setting by addressing the behavioral policy optimization (BPO) problem. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance as much as possible. We provide an iterative algorithm that alternates between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Simulation Techniques and Applications
MethodsFocus
