A Self-Tuning Actor-Critic Algorithm

Tom Zahavy; Zhongwen Xu; Vivek Veeriah; Matteo Hessel; Junhyuk Oh,; Hado van Hasselt; David Silver; Satinder Singh

arXiv:2002.12928·stat.ML·April 15, 2021·32 cites

A Self-Tuning Actor-Critic Algorithm

Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh,, Hado van Hasselt, David Silver, Satinder Singh

PDF

Open Access 1 Video

TL;DR

This paper introduces STAC, a reinforcement learning algorithm that automatically adapts hyperparameters online using meta-gradient descent, leading to improved performance across various benchmarks without significant computational overhead.

Contribution

The paper presents a novel self-tuning actor-critic algorithm that automatically adapts all differentiable hyperparameters, discovers auxiliary tasks, and enhances off-policy learning with a new leaky V-trace operator.

Findings

01

STAC improved median human normalized score from 243% to 364% in Atari 200M steps.

02

STAC increased mean score from 217 to 389 on DM Control with features.

03

STAC enhanced pixel-based learning from 108 to 202 in DM Control.

Abstract

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

#49 - Meta-Gradients in RL - Dr. Tom Zahavy (DeepMind)· youtube

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsSigmoid Activation · Tanh Activation · Experience Replay · Entropy Regularization · Residual Connection · Gradient Clipping · RMSProp · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution