Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Yunhao Tang, Tadashi Kozuno, Mark Rowland, R\'emi Munos, Michal Valko

TL;DR
This paper introduces a unified framework for estimating higher-order derivatives in meta-reinforcement learning using off-policy evaluation, addressing bias and variance issues and enabling practical implementation with auto-differentiation.
Contribution
It unifies existing Hessian estimation methods under a common framework and proposes new estimators that improve practical performance.
Findings
Framework clarifies bias-variance trade-offs in Hessian estimates
New estimators are easily implemented with auto-differentiation
Performance gains demonstrated in meta-reinforcement learning tasks
Abstract
Model-agnostic meta-reinforcement learning requires estimating the Hessian matrix of value functions. This is challenging from an implementation perspective, as repeatedly differentiating policy gradient estimates may lead to biased Hessian estimates. In this work, we provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates. This framework also opens the door to a new family of estimates, which can be easily implemented with auto-differentiation libraries, and lead to performance gains in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Fuel Cells and Related Materials · Adversarial Robustness in Machine Learning
